In God We Trust. All Others Must Bring Data. — W. Edwards Deming. Using Word Embeddings to Recognize Idioms
    In God we trust. All others must bring data. — W. Edwards Deming
               Using word embeddings to recognize idioms
                          Jing Peng, Anna Feldman
                        Department of Computer Science
                           Department of Linguistics
                           Montclair State University
         pengj@mail.montclair.edu, feldmana@mail.montclair.edu

                     Abstract                                  Idioms are conventiolized expressions whose
                                                            figurative meanings cannot be derived from literal
    Expressions, such as add fuel to the fire,              meaning of the phrase. There is no single agreed-
    can be interpreted literally or idiomatically           upon definition of idioms that covers all members
    depending on the context they occur in.                 of this class (Glucksberg, 1993; Cacciari, 1993;
    Many Natural Language Processing appli-                 Nunberg et al., 1994; Sag et al., 2002; Villavicen-
    cations could improve their performance if              cio et al., 2004; Fellbaum et al., 2006). At the
    idiom recognition were improved. Our ap-                same time, idioms do not form a homogeneous
    proach is based on the idea that idioms vi-             class that can be easily defined. Some examples
    olate cohesive ties in local contexts, while            of idioms are I’ll eat my hat (I’m confident), Cut
    literal expressions do not. We propose                  it out (Stop talking/doing something), a blessing
    two approaches: 1) Compute inner prod-                  in disguise (some bad luck or misfortune results
    uct of context word vectors with the vec-               in something positive), kick the bucket (die), ring
    tor representing a target expression. Since             a bell (sound familiar), keep your chin up (remain
    literal vectors predict well local contexts,            cheerful), piece of cake (easy task), miss the boat
    their inner product with contexts should be             (miss out on something), (to be) on the ball (be at-
    larger than idiomatic ones, thereby telling             tentive/competent), put one’s foot in one’s mouth
    apart literals from idioms; and (2) Com-                (say something one regrets), rake someone over
    pute literal and idiomatic scatter (covari-             the coals (to reprimand someone severely), under
    ance) matrices from local contexts in word              the weather (sick), a hot potato (controversial is-
    vector space. Since the scatter matri-                  sue), an arm and a leg (expensive), at the drop of a
    ces represent context distributions, we can             hat (without any hesitation), barking up the wrong
    then measure the difference between the                 tree (looking in the wrong place), beat around the
    distributions using the Frobenius norm.                 bush (avoiding main topic).
    For comparison, we implement Fazly et al.                  It turns out that expressions are often ambigu-
    (2009)’s, Sporleder and Li (2009)’s, and                ous between an idiomatic and a literal interpreta-
    Li and Sporleder (2010b)’s methods and                  tion, as one can see in the examples below 1 :
    apply them to our data. We provide ex-                     (A) After the last page was sent to the printer,
    perimental results validating the proposed              an editor would ring a bell, walk toward the door,
    techniques.                                             and holler ” Good night! ” (Literal) (B) His
                                                            name never fails to ring a bell among local voters.
1   Introduction                                            Nearly 40 years ago, Carthan was elected mayor
Natural language is filled with emotion and im-             of Tchula. . . (Idiomatic)
plied intent, which are often not trivial to detect.           (C) . . . that caused the reactor to literally blow
One specific challenge are idioms. Figurative lan-          its top. About 50 tons of nuclear fuel evapo-
guage draws off of prior references and is unique           rated in the explosion. . . (Literal) (D) . . . He didn’t
to each culture and sometimes what we don’t say             pound the table, he didn’t blow his top. He always
is even more important than what we do. This,               kept his composure. (Idiomatic)
naturally, presents a significant problem for many             1
                                                                These examples are extracted from the Corpus of Con-
Natural Language Processing (NLP) applications              temporary American English (COCA) (http://corpus.
as well as for big data analytics.                          byu.edu/coca/

   (E) . . . coming out of the fourth turn, slid down          cur are more affective. Still, they treat idioms as
the track, hit the inside wall and then hit the atten-         semantic outliers.
uator at the start of pit road. (Literal) (F) . . . job
training, research and more have hit a Republican              3     Our Approach
wall. (Idiomatic)
   Fazly et al. (2009)’s analysis of 60 idioms from            We hypothesize that words in a given text seg-
the British National Corpus (BNC) has shown                    ment that are representatives of the local context
that close to half of these also have a clear lit-             are likely to associate strongly with a literal ex-
eral meaning; and of those with a literal mean-                pression in the segment, in terms of projection (or
ing, on average around 40% of their usages are                 inner product) of word vectors onto the vector rep-
literal. Therefore, idioms present great challenges            resenting the literal expression. We also hypoth-
for many Natural Language Processing (NLP) ap-                 esize that the context word distribution for a lit-
plications. Most current translation systems rely              eral expression in word vector space will be dif-
on large repositories of idioms. Unfortunately,                ferent from the distribution for an idiomatic one.
more frequently than not, MT systems are not able              This hypothesis also underlies the distributional
to translate idiomatic expressions correctly.                  approach to meaning (Firth, 1957; Katz and Gies-
   In this paper we describe an algorithm for auto-            brecht, 2006).
matic classification of idiomatic and literal expres-
sions. Similarly to Peng et al. (2014), we treat id-           3.1    Projection Based On Local Context
ioms as semantic outliers. Our assumption is that                     Representation
the context word distribution for a literal expres-
                                                               The local context of a literal target verb-noun con-
sion will be different from the distribution for an
                                                               struction (VNC) must be different from that of an
idiomatic one. We capture the distribution in terms
                                                               idiomatic one. We propose to exploit recent ad-
of covariance matrix in vector space.
                                                               vances in vector space representation to capture
                                                               the difference between local contexts (Mikolov et
2   Previous Work
                                                               al., 2013a; Mikolov et al., 2013b).
Previous approaches to idiom detection can be                     A word can be represented by a vector of fixed
classified into two groups: 1) type-based extrac-              dimensionality q that best predicts its surrounding
tion, i.e., detecting idioms at the type level; 2)             words in a sentence or a document (Mikolov et al.,
token-based detection, i.e., detecting idioms in               2013a; Mikolov et al., 2013b). Given such a vector
context. Type-based extraction is based on the                 representation, our first proposal is the following.
idea that idiomatic expressions exhibit certain lin-           Let v and n be the vectors corresponding to the
guistic properties such as non-compositionality                verb and noun in a target verb-noun construction,
that can distinguish them from literal expressions             as in blow whistle, where v 2