Modeling metaphor perception with distributional semantics vector space models Kat R. Agres,1 Stephen McGregor,1 Karolina Rataj,2,3 Matthew Purver,1 Geraint A. Wiggins1 1 Queen Mary University of London, London, UK {kathleen.agres,s.e.mcgregor,m.purver,geraint.wiggins}@qmul.ac.uk 2 Department of Cognitive Psychology and Ergonomics, University of Twente, Enschede, the Netherlands 3 Faculty of English, Adam Mickiewicz University, Poznań, Poland krataj@wa.amu.edu.pl Abstract. In this paper, we present a novel application of a compu- tational model of word meaning to capture human judgments of the linguistic properties of metaphoricity, familiarity, and meaningfulness. We present data gathered from human subjects regarding their ratings of these properties over a set of word pairs specifically designed to ex- hibit varying degrees of metaphoricity. We then investigate whether these properties can be measured in terms of geometric features of a model of distributional lexical semantics. We compare the performance of two models, our own Concept Discovery Model which dynamically constructs context-sensitive subspaces, and a state-of-the-art static distributional semantic model, and find that our dynamic model performs significantly better in its measurement of metaphoricity. Keywords: metaphor, distributional semantics, vector space models, computational creativity 1 Introduction In this study, we investigate whether computational models of lexical meaning might help explain human comprehension of metaphors. We examine several alternative models, based on language statistics in a large general collection of English, to see if they capture relations between words which correlate with human judgments of metaphoricity. Psycholinguistic studies Research on metaphor in human participants has at- tempted to clarify the mechanisms underlying understanding metaphoric lan- guage. One of the earliest approaches, the standard pragmatic model, stipulates that the literal meaning of a metaphoric sentence needs to be rejected before the figurative meaning is generated [10]. Behavioral studies inspired by this model have shown that participants do not use more time to comprehend metaphoric than literal sentences, and that metaphoric meaning is generated in parallel Copyright © 2016 for this paper by its authors. Copying permitted for private and academic purposes. 2 with the literal meaning of an utterance [8, 29]. These studies, however, did not make a distinction between conventional and novel metaphors. Later reports have shown that novel metaphors do require more processing time than literal sentences, while conventional metaphor and literal language comprehension take comparable time [2]. Recently, a number of electrophysiological (EEG) studies investigating meta- phor comprehension have been reported in which the N400 component, a negative- going wave observed between 300 and 500ms after the presentation of the critical stimulus, has received considerable attention. Larger N400 amplitudes have been observed in the processing of metaphoric as compared to literal sentences, with no differences in component latency (i.e., the time window within which the ef- fect is observed) or scalp distribution (i.e., the sites on the scalp over which the effect is present) (e.g., [2]). This increase in amplitude has been interpreted as reflecting more activity in memory needed to retrieve the semantic information necessary for comprehension of metaphoric as compared to literal sentences[16]. At the same time, comparable latency and scalp distribution of the compo- nent might be indicative of the involvement of similar mechanisms in literal and metaphoric language comprehension. Interestingly, differences have been found between conventional and novel metaphors, with the N400 amplitudes for con- ventional metaphors falling in-between those for novel metaphoric and literal utterances [2]. This graded effect has not been observed in reaction time stud- ies, which demonstrates that ERP measures offer greater sensitivity to the time course of cognitive processes involved in metaphor comprehension. These find- ings raise important questions concerning the nature of the mechanisms involved in understanding metaphors. One of the approaches that has attempted to elucidate metaphor comprehen- sion mechanisms is the structure mapping model and its descendant, the career of metaphor model, which stipulate that the same mechanisms are involved in the comprehension of literal comparisons, similes, and metaphors [5, 30]. Within this view, comprehending metaphoric sentences, like my mind is a warehouse, requires a mapping that involves a symmetrical mechanism of alignment of re- lational commonalities in the source (warehouse) and target (mind ), together with an asymmetrical mechanism of inference projection from the source to the target. Moreover, the career of metaphor model assumes that while novel metaphors are understood via comparison, categorization is involved in conven- tional metaphor understanding. These assumptions have received some support in ERP studies, which have also shown that a shared mapping process may be involved in categorization and comparison [9, 17]. Moreover, comparison seems to facilitate not only novel, but also conventional metaphor comprehension, al- though this facilitation is observed at later processing stages than in the case of novel metaphors. Computational studies Computational approaches to metaphor have generally focused on a combination of pattern matching and hand coded information pro- cessing [25]. The KARMA model for metaphor understanding [7], for instance, attempts to encode environmentally grounded knowledge about action in the world into a framework of transferable domains. In a similar vein, ATT-Meta 3 [3] asks users to provide domain specific knowledge about entities and processes and then ports this knowledge between different contexts—this context sensitive aspect of the system in particular aligns with both the approach to theoretical work on metaphor and to the computational work presented here. Moving towards data-driven approaches, a model for metaphor comprehen- sion has been described that employs latent semantic analysis, a statistical tech- nique for building spaces of word similarity, to the selection of the salient fea- tures trafficked between a metaphoric source and target [15]. This technique, along with a similar method involving the selection of transferred features based on proximity in a semantic space [27], bears comparison to our dynamically con- textual method as described in Section 3.2: each of these models attempts to extract particular features of a semantic space in order to capture the semantic context of a metaphor. A more recent description of a Service-Oriented Architec- ture [28] discovers properties of source and target by matching patterns within large-scale web corpora and then looks for properties salient in the source which can be transferred to the target. Other contemporary approaches have tended towards the more overtly statistical, with for instance the application of linear algebraic operations to model the metaphoric composition of vector space type word representations [11]. The computational component of the work presented here broadly falls within the paradigm of the distributional hypothesis, which holds that “words that occur in similar contexts have similar meanings,” [26, p. 148]. The general method- ology of distributional semantics involves the traversal of large scale textual corpora in order to build spaces of word-vectors where the proximity of two vectors reflects the tendency for those two words to be observed co-occurring with similar terms [6]. Initial approaches to distributional semantics typically involved building up word representations based on straight-up co-occurrence counts [24], while more recent methodologies have often incorporated matrix factorisation techniques to derive dense matrices from co-occurrence statistics [13] or employed neural network architectures to derive word embeddings from observations of co-occurrences across iterative traversals of a corpus [4]. In the study presented here, we will in particular be comparing Word2Vec [21], a neural network driven model for generating word embeddings that has achieved state-of-the-art results on tests of word similarity and analogy comple- tion, with our own Concept Discovery Model [19], which deploys a word-counting approach to distributional semantics to dynamically construct contextualised subspaces in which conceptual relationships play out as geometric relationships [20]. We will be examining the ways in which spaces generated by each model compare with human assessments of the degree of metaphoricity, familiarity, and meaningfulness in noun-verb word pairings. With this in mind, recent work using distributional models enriched with information from lexical and associa- tive knowledge bases to build spaces of word-vectors constructed for detecting similarity or relatedness should be taken into consideration [14]. 4 2 Modeling human metaphor judgements Our objective in this study is to explore the ways in which geometric models of word meaning can capture the perception of metaphor, and in particular can measure the degree to which two-word phrases are perceived as being metaphor- ical. To do so, we compare words’ relations in the geometric model with human judgments of metaphoricity, via a set of empirically-derived normative data. Note that while data were collected for three types of norming measures (metaphoric- ity, meaningfulness, and familiarity), the principal aim of the computational work is to model the perception of metaphoricity, that is, to discover meaningful subspaces that reflect the extent to which a two-term expression is perceived as being metaphorical. 2.1 Materials The materials were collected for an ERP study, which investigated metaphor comprehension in bilinguals [12]. Verb-noun word dyads in Polish (native lan- guage) and English (second language) were used in the ERP experiment. In each case, the verb was considered the metaphoric source and the noun the target: so, for example, in the instance of the conventional metaphor “cut pollution”, some salient property of the action cutting is being transferred to the entity pollution. Prior to the ERP experiment, five normative studies were carried out to en- sure the word pairs fell within the following three categories: novel metaphors (e.g., to harvest courage), conventional metaphors (e.g., to gather courage), and literal expressions (e.g., to experience courage). Based on the results of the normative studies, the final set of 228 English verb-noun word dyads (76 in each category) was selected for the purpose of the current study. Out of the five normative studies, four will be reported here. The statistical analyses con- sisted of mixed-design analyses of variance (ANOVAs), with utterance type as a within-subject factor and survey block as a between-subject factor. No main effect of survey block was observed. Significance values for the pairwise compar- isons were corrected for multiple comparisons using the Bonferroni correction. When Mauchlys tests showed that the assumption of sphericity was violated, the Greenhouse-Geisser correction was applied. In such cases, the original degrees of freedom are reported with the corrected p value. The demographic data for the participants of the four normative studies are presented in Table 1. Table 1. Demographic characteristics of participants of the four normative studies, including the number of participants (number of female participants) and mean age. Normative study type Number of participants(female) Mean age Cloze probability 140 (65) 23 Meaningfulness ratings 133 (61) 22 Familiarity ratings 101 (55) 23 Metaphoricity ratings 102 (59) 22 5 Cloze probability Because reduced N400 amplitudes have been observed in relation to expected as compared to unexpected words, a cloze probability test was performed prior to the ERP study to ensure the second word in a given word dyad was not highly anticipated by the participants of the ERP experiment. Each participant of the cloze probability test received the first word of a given word pair, and was asked to provide the second word, so that the two words would make a meanigful expression. Due to the length of the test, all word pairs were divided into four blocks, so that each word was completed by 35 participants. If a given word pair was observed in the cloze probability test more than 3 times, the word pair was excluded from the final set and replaced with a new one. This procedure was repeated until the cloze probability for word pairs in all categories did not exceed 8%. Meaningfulness In order to assess the meaningfulness of the stimuli, par- ticipants were asked to rate how meaningful a given word pair was on a scale from 1 (totally meaningless) to 7 (totally meaningful). The set of 228 word dyads was divided into four survey blocks in order to avoid the repetition of the target word within the same survey. Additionally, 76 meaningless word pairs were included in this normative study. The results revealed a main effect of utterance type, [F(3, 387) = 1611.54, p < .001,  = .799, ηp2 = .93]. Pairwise comparisons revealed that literal word pairs were assessed as more meaningful (M = 5.99, SE = .05) than conventional metaphors (M = 5.17, SE = .06) (p < .001), and conventional metaphors were assessed as more meaningful than novel metaphors (M = 4.09, SE = .08)(p < .001). Familiarity Familiarity of each word pair was assessed in another normative study. Participants were asked to decide how often they had encountered the presented word pairs on a scale from 1 (very rarely) to 7 (very frequently). The set of 228 word dyads was divided into three survey blocks in order to avoid the repetition of the target word within the same survey. Again, a main effect of utterance type was found, [F (2, 296) = 470.97, p < .001,  = .801, ηp2 = .83]. Pairwise comparisons showed that novel metaphors (M = 2.15, SE = .07) were rated as less familiar than conventional metaphors (M = 2.97, SE = .08), (p < .001), with literal expressions being most familiar (M = 3.85, SE = .09), (p < .001). Furthermore, conventional metaphors were less familiar than literal word dyads, (p < .001). It is crucial to note that although differences were observed between categories, all word pairs were relatively unfamiliar. This is visible in the mean score for literal word pairs, which are most familiar of all three categories, but at the same time relatively low in familiarity (below 4 on a scale where 6 and 7 represent very familiar items). The reason why familiarity was low in all three categories is the same as for the cloze probability test, i.e., that we intentionally excluded highly probable combinations. Metaphoricity In order to assess the metaphoricity of the word pairs, partic- ipants were asked to decide how metaphoric a given word dyad was on a scale 6 from 1 (very literal) to 7 (very metaphoric). The set of 228 word dyads was again divided into three survey blocks in order to avoid the repetition of the target word within the same survey. The results revealed a main effect of utter- ance type, [F (2, 198) = 588.82, p < .001,  = .738, ηp2 = .86]. Pairwise compar- isons confirmed that novel metaphors (M = 5.00, SE = .06) were rated as more metaphoric than conventional metaphors (M = 3.98, SE = .06), (p < .001), and conventional metaphors were rated as more metaphoric than literal utterances (M = 2.74, SE = .07), (p < .001). 3 Computational Modeling Method In order to computationally model human judgment of the conceptual features of word dyads, we construct distributional semantic spaces where the proximity of word-vectors relates to their semantic similarity, and then explore the geom- etry of these spaces for ways of mapping relationships between words that are productive with regard to such conceptual, cognitive phenomena as metaphor. Specifically, we compare two different distributional semantic models to assess the difference in performance between a model that might be described as static, such as the one outlined in Section 3.1, versus one that is contextually dynamic, as is the intent with our own model as explained in Section 3.2. For both our static and dynamic models, we train vectors on the English language version of Wikipedia. For the purpose of capturing word co-occurrences, we focus only on the descriptive content of Wikipedia pages, ignoring headers, lists, captions, and the like. Considering only sentences at least five words in length, we strip the corpus of punctuation, remove articles (the, a, and and ), and remove parenthetical phrases, resulting in an overall corpus of approximately 7.5 million word types and 1.1 billion word tokens. For the construction of both models, we consider context windows of five words on either side of a target word, treating sentence endings as contextual boundaries as well. We take the 200,000 most frequently occurring words in the corpus as the vocabulary for both models, constructing one word-vector for each word in the vocabulary. As our measure of semantic relatedness between two words, we take the cosine similarity between their corresponding word-vectors, in line with a number of other contemporary distributional semantic models [18, 23]. It should be noted that in the case of a normalised distributional semantic space, such as that described in Section 3.1, relations based on cosine similarity are equivalent to those based on Euclidean distance. 3.1 Word2Vec As our primary point of comparison in this study, we use the Word2Vec distri- butional semantic model [22]. This model has achieved state of the art results on analogy completion tasks in particular, and has generally received widespread attention within the field of computational linguistics. A critical feature of the model is its deployment of a neural network to build a space of word-vectors. One result of this process is that the model’s dimensionality cannot be interpreted: 7 Word2Vec treats a dimension as an arbitrary handle for pulling word-vectors into the desired relationship based on observations of co-occurrences in train- ing data. Therefore, in comparison to our model described in Section 3.2, it is not possible to project dimensionally contextualized subspaces from a Word2Vec type model in a direct manner (while perhaps a separate neural network could be designed and trained specifically to perform this projection, this is beyond the scope of this paper). Two different network architectures have been reported in the literature; here, we employ the Skip-gram architecture, consisting of a two-layer neural network which learns to predict context terms based on an input word, as this approach has been reported as performing particularly well on semantically oriented tasks [21]. The model takes the form of a set of word-vectors arrayed across the surface of a hypersphere. Here, we build a 300-dimensional space based on 10 passes over the corpus described above, with a negative sampling rate of 10. To assess the model’s ability to capture the human metaphoricity judgment data, we then measure the cosine similarity between the word-vectors for each word in each word pair from the study described in Section 2. 3.2 Conceptual Discovery Model We compare the performance of the established Word2Vec semantic model with the output of a distributional semantic model which dynamically interprets in- put text to project context-sensitive subspaces from a sparse, high-dimensional base space. As described in detail elsewhere [1, 19], the Conceptual Discovery Model builds a base space populated by what might be described as literal sta- tistical data about word co-occurrences as observed in a large-scale corpus: each dimension in the space corresponds directly to a co-occurrence term, and no matrix factorisation or other dimensional reduction technique is applied to this base space. Rather, each dimension c of a word-vector → − w is populated with a pointwise mutual information (PMI) score based on this equation, where nw,c represents the frequency at which word w is observed occurring within 5 words of word c, nw is the independent frequency of w, nc is the independent frequency of c, W is the total word count, and a is a smoothing constant:   − → nw,c × W wc = log2 +1 (1) nw × (nc + a) We build a base space of roughly 7.5 million dimensions, corresponding to the number of word types in Wikipedia. From this base space, we dynamically pick 200-dimensional subspaces, specific to each word-pair in the study. Three different methodologies for projecting subspaces will be discussed below, but in each case, the input used to determine the projection is simply the pair of words involved in a potentially metaphoric dyad, and the projection is based on an analysis of the respective values of these inputs along any given dimension. The intuition behind this methodology is that subspaces consisting of dimensions which are mutually salient for both components of a dyad will capture something of the semantic context in which the candidate metaphor might be meaningful. 8 Within these subspaces, as with Word2Vec, we assess the relationship between the words in a word pair in terms of the cosine similarity between their two corresponding word-vectors. One of the primary considerations in the application of this model is therefore the method for selecting these subspaces. Discovering conceptually relevant spaces We experimented with three dif- ferent techniques for choosing subspaces from our base space, in each case fo- cusing on the relationship between the target and source word in each dyad. 60 pollution pollution pollution pollution pollution 60 60 pollution 40 40 40 20 cause cause 20 20 pierce cause cut cut pierce cut pierce 0 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 0 20 40 60 (a) Noun-Only Subspaces (b) Joint Subspaces (c) Independent Subspaces Fig. 1. Here three different types of subspaces are presented, with three expressions involving the word “pollution” superimposed on each two-dimensional projection: the literal phrase “cause pollution”, the conventional metaphor “cut pollution”, and the novel metaphor “pierce pollution”. Angles between the vectors for both words in each pair are measured for correlation with human judgments of the metaphoricity of each expression. Angles and vector lengths from the 200 dimensional subspaces we analysed are preserved in these projections. The word-vector for pollution is the same for all three versions of the noun-only space, since the other terms have no influence on the selection of dimensions here. – Noun-Only Subspaces: the subspace is selected based only on associations with the target term: we take the 200 dimensions with the highest PMI value (as expressed in Equation 1) for the target (i.e., noun) in a given dyad. – Joint Subspaces: selection is based on associations shared by the source and target terms: we select the 200 dimensions with the highest average PMI for both target and source terms in each dyad. – Independent Subspaces: selection is based on independent associations with the source term and the target term, such that we select the 100 terms with the highest PMI values for the source term and the target term inde- pendently, and then merge these two sets of dimensions into a single 200 dimensional space. An example of how a target and source dyads manifest in these three subspaces is shown in Fig. 1. 9 4 Results and Discussion For both Word2Vec and CDM, cosine similarity values were computed for each word pair used in the behavioral study described above. Because the human ratings are the ground truth in this instance, Cosine Similarity is the dependent variable in each of the multiple regressions reported below. The three measures provided by human raters – Metaphoricity, Meaningfulness, and Familiarity – are the independent variables used in the analyses. The general aim is to identify which aspects of the word pairs (in terms of perceived metaphoricity, etc) are captured by cosine similarity in a given space. We also explore which type of subspace is best able to capture metaphor alone (that is, which space accounts for the most variability in human responses for metaphoricity). We first report the results for Word2Vec, which are then used as a baseline against which to compare the results of our CDM model. The results for CDM are broken down by the type of underlying subspace. 4.1 Word2Vec results. The results of the multiple regression analysis for Word2Vec indicated that the predictors accounted for a significant proportion of the variance in Cosine Simi- larity scores [R2 = .249, F (3, 224) = 24.81, p < .001]. Metaphoricity significantly predicted Cosine Similarity scores, [β = −0.25, t(224) = −3.09, p < .01], as did Familiarity [β = 0.22, t(224) = 2.65, p < .01]. Low values of Metaphoricity tend to yield high values of Cosine Similarity, and low values of Familiarity tend to yield low values of Cosine Similarity. Meaningfulness was not a significant predictor in the regression. First, these results confirm that Cosine Similarity does, in fact, capture more information than simply similarity about a given pair of terms: both Metaphoric- ity and Familiarity help to account for the variance in Cosine Similarity values for Word2Vec. Second, these results provide a standard by which we are able to compare the Conceptual Discovery Model’s performance. 4.2 CDM results. Unlike Word2Vec, the CDM model affords the discovery of different kinds of geometrically-defined subspaces. The crucial advantage of the CDM model is its ability to project a context-specific subspace geared towards capturing the semantics of situations in which a metaphor can be meaningfully applied. As such, our objective is to compare the performance of Cosine Similarity scores for detecting properties of metaphors using different techniques for constructing context-specific subspaces, in particular, the Noun-only, Joint, and Independent methods described in Section 3.2. The same multiple regression analysis as above was performed for these three CDM model configurations. Finally, the relation- ship between Cosine Similarity and Metaphoricity ratings is explored in more depth for the best performing model. 10 Noun-only subspaces and Joint subspaces The regression analysis for the Noun-only subspaces indicates that the predictors account for a limited propor- tion of the variance of Cosine Similarity scores, [R2 = .108, F (3, 224) = 9.04, p < .01]. Metaphoricity significantly predicts Cosine Similarity scores, [β = −0.30, t(224) = −3.48, p < .01], where higher Metaphoricity ratings are associated with lower values of cosine similarity. The regression analysis for the Joint subspace does not yield any significant results, with R2 = .016, F (3, 224) = 1.19, p = n.s. Given the poor performance of the above models (which is quantified in the low R2 for all four model configurations) compared with Word2Vec, we conclude that neither the Noun-only subspace nor the Joint subspace are suitable for capturing perceived metaphoricity of word pairs. Previous computational approaches to metaphor generation and interpretation [28, 31] have highlighted the fact that successful metaphors often result from situation in which the salient properties of one term (e.g., the target) are distinct from the salient properties of the other term. Therefore, the results may indicate that these two types of subspaces do not capture the necessary imbalance of salient properties between terms necessary to reflect metaphorical language. In other words, the Noun-only and Joint methods of delineating subspaces do not seem to select dimensions that are more salient for one term than the other, suggesting that Independent subspaces may provide a better way of capturing this information. Independent subspaces The results of the multiple regression analysis for the Independently-constructed subspaces indicate that Metaphoricity accounts for a significant proportion of the variance in Cosine Similarity scores [R2 = .271, F (3, 224) = 27.73, p < .001], and Metaphoricity significantly predicts Cosine Similarity [β = −0.37, t(224) = −4.72, p < .001]. Of the three multiple regressions reported here, this analysis accounts for the most variability in cosine similarity values, with an R2 of .271, which is signifi- cantly higher than the regression for Word2Vec (where the R2 was .249). Note that, interestingly, Familiarity is not significant in this analysis. The interpreta- tion of this finding is discussed below in the General Discussion. To visualize how the relationship between Cosine Similarity and Metaphoric- ity varies by utterance type (Conventional metaphor, Novel metaphor, and Lit- eral word pair), a correlation analysis is shown in Fig. 2 for this best-performing model, with utterance types demarcated. Metaphoricity is inversely correlated with Cosine Similarity [r = −.50, t = 32.49, p < .001], such that word pairs rated as highly metaphorical tend to have low Cosine Similarity values. Novel metaphor word pairs, which participants rated highest for Metaphoricity, gener- ally have low Cosine Similarity scores. This trend is shared by the Conventional metaphor word pairs, although the Metaphoricity scores tend to be slightly lower (this is confirmed by examining the averages of these two utterance types). Fi- nally, Literal word pairs, which garnered the lowest ratings for Metaphoricity, tend to have slightly higher Cosine Similarity values overall. In sum, whereas Cosine Similarity in Word2Vec is correlated with both Metaphoricity and Familiarity, the flexibility of our CDM model (specifically, the ability of our model to discover specific, conceptually-relevant spaces) allows 11 Fig. 2. Correlation between Cosine Similarity and Metaphoricity, including visualisa- tion of Pair Types. us to discover a space in which Cosine Similarity reflects only the metaphorical aspects of word pairs. 5 General Discussion The set of results for Word2Vec and CDM offers important insight for the com- putational simulation of metaphoric language use. Firstly, for both Word2Vec and the Independent subspaces version of the CDM model, human ratings of Metaphoricity were able to account for a significant proportion of the variability in Cosine Similarity scores. Although only 24-27% of the variance was explained, it is important to consider that utterance type (which significantly influenced ratings) was not included in the statistical analyses above, because 1) the present research investigates the extent to which cosine similarity (alone) accounts for perceived metaphoricity between two terms, and 2) information about utterance type would not be available when applying our model in other contexts. The performance of Word2Vec was used as a standard with which to compare the three variants of our CDM model. Both Familiarity and Metaphoricity were significant predictors of Cosine Similarity for Word2Vec, but for CDM, we were able to find a subspace that captures solely the perceived Metaphoricity of word pairs (because Metaphoricity was the only significant effect in the regression analysis). Furthermore, this Independent subspaces model performed best out of all of the models tested here, with an R2 of .271 (while Word2Vec had an R2 of only .249). Although this is only a modest improvement over Word2Vec, the difference does suggest that our CDM has a greater capacity to capture perceived metaphoricity. We therefore conclude that the Independent subspace 12 method offers both the most accurate and the most direct model of Metaphoricity (without confounding effects from Familiarity or Meaningfulness). It is interesting to consider the comparative performance of the differently- configured CDM models, and explore why the Independent subspaces technique results in by far the best subspaces for mapping human judgments of metaphoric- ity to cosine similarity between word-vectors. In as much as a “property theoretic view of metaphor” [28, p. 56] has been laid out in computational terms, the ex- pectation is that a successful computational model of metaphor will capture the way in which salient properties of a source are mapped to specific instances of a target. The subspaces generated by our model are intended to represent a conceptually relevant contextualisation of a lexical space: the dimensions of these subspaces consist of sets of words which taken independently offer only anecdotal glimpses into the way language is used, but which collectively can be understood as a certain way of speaking about a conceptually coherent topic. So in a metaphorically relevant subspace, we hope to discover a de facto overlapping of some but not all of the properties of source and target. Rather than discover spaces where the mutual properties of two conceptual domains are already to some extent emphasised – as is the case with our Joint subspaces – we seek spaces where only a degree of overlap between the salient properties of each conceptual domain can be found, and where precisely this feature of a space is significant. This explains the efficacy of our Noun-only and, moreover, Indepen- dent subspaces in mapping human judgments of metaphor. In the case of the Noun-only subspace, we establish a context emphasizing the salient properties of the noun; to the extent that a verb expresses a conceptually paradigmatic action in this context, the cosine similarity between noun and verb word-vectors will be high, becoming lower as the noun-verb relationship becomes more metaphori- cal in nature. This phenomenon is considerably more evident in our Independent subspaces, where cosine similarity shows an inverse correspondence to the salient properties of each component of the dyad which have been merged into a single hybrid context. It is worth noting that distributional semantic models have typically been applied to tasks involving the identification of word similarity, with the under- lying intuition regarding these spaces being that similar words occur in similar contexts. Similarity here must be understood in a different light than the fa- miliarity inherent in a word pairing: we might expect familiarity to correlate roughly with a tendency towards juxtaposition, and so a statistical measure of familiarity might emerge simply from calculating the PMI between two co- occurring words. Nonetheless, we must also note that words that tend to occur together will necessarily also tend to occur together in the same context, and so we might expect familiarity to emerge as a kind of artifact of this tendency in spaces geared towards similarity. It is therefore not surprising that a fairly standard distributional semantic model such as Word2Vec captures a degree of familiarity in measures of cosine similarity. With this in mind, we might imagine a way forward towards building more nuanced subspaces particularly geared to prise apart judgments of metaphoricity. We could, for instance, investigate techniques for building subspaces that focus 13 primarily on the source component of a word pair – the verb, in the cases studied here – in order to draw out the salient properties of the source and then measure the degree to which these properties are typically transferable to a target. Finally, in future work we hope to use our findings regarding the geometric properties of subspaces to discover how people are likely to interpret new word pairs. In contrast to the research presented above, where human ratings were used to explain the variance in cosine similarity scores, this future direction will use cosine similarity scores for novel dyads to predict the degree to which human participants will perceive a given dyad as being metaphorical. Acknowledgments This research is supported by the project ConCreTe, which acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Com- mission, under FET grant number 611733. This research has also been supported by EPSRC grant EP/L50483X/1. References 1. Agres, K., McGregor, S., Purver, M., Wiggins, G.: Conceptualising creativity: From distributional semantics to conceptual spaces. In: Proceedings of the 6th Interna- tional Conference on Computational Creativity. Park City, UT (2015) 2. Arzouan, Y., Goldstein, A., Faust, M.: Brainwaves are stethoscopes: ERP correlates of novel metaphor comprehension. Brain Research 1160, 69–81 (2007) 3. Barnden, J.: Uncertainty and conflict handling in the ATT-Meta context-based system for metaphorical reasoning. In: Third International Conference on Modeling and Using Context. pp. 15–29 (2001) 4. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! In: ACL 2014 (2014) 5. Bowdle, B.F., Gentner, D.: The career of metaphor. Psychological Review 112(1), 193 (2005) 6. Clark, S.: Vector space models of lexical meaning. In: Lappin, S., Fox, C. (eds.) The Handbook of Contemporary Semantic Theory. Wiley-Blackwell (2015) 7. Feldman, J., Narayanan, S.: Embodied meaning in a neural theory of language. Brain and Language 84, 385–392 (2004) 8. Gibbs, R.W., Bogdanovich, J.M., Sykes, J.R., Barr, D.J.: Metaphor in idiom com- prehension. Journal of Memory and Language 37(2), 141–154 (1997) 9. Goldstein, A., Arzouan, Y., Faust, M.: Killing a novel metaphor and reviving a dead one: ERP correlates of metaphor conventionalization. Brain and Language 123(2), 137–142 (2012) 10. Grice, H.P.: Logic and conversation. In: Cole, P., Morgan, J.L. (eds.) Syntax and Semantics, pp. 41–58. Academic Press, New York (1975) 11. Gutiérrez, E.D., Shutova, E., Marghetis, T., Bergen, B.K.: Literal and metaphorical senses in compositional distributional semantic models. In: Proceedings of the 54th Meeting of the Association for Computational Linguistics (2016, to appear) 12. Jankowiak, K., Naskrecki, , R., Rataj, K.: Event-related potentials of bilingual fig- urative language processing. In: Poster presented at the 19th Conference of the European Society for Cognitive Psychology. Paphos, Cyprus (2015) 14 13. Kiela, D., Clark, S.: A systematic study of semantic vector space model parameters. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC) @ EACL 2014. pp. 21–30. Gothenburg (2014) 14. Kiela, D., Hill, F., Clark, S.: Specializing word embeddings for similarity or relat- edness. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pp. 2044–2048 (2015) 15. Kintsch, W., Bowles, A.R.: Metaphor comprehension: What makes ametaphor dif- ficult to understand? Metaphor and Symbol 17(4), 249–262 (2002) 16. Kutas, M., Federmeier, K.D.: Thirty years and counting: Finding meaning in the n400 component of the event related brain potential (ERP). Annual Review of Psychology 62, 621 (2011) 17. Lai, V.T., Curran, T.: ERP evidence for conceptual mappings and comparison processes during the comprehension of conventional and novel metaphors. Brain and Language 127(3), 484–496 (2013) 18. Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word repre- sentations. In: 18th Conf. on Computational Natural Language Learning (2014) 19. McGregor, S., Agres, K., Purver, M., Wiggins, G.: From distributional semantics to conceptual spaces: A novel computational method for concept creation. Journal of Artificial General Intelligence (2015) 20. McGregor, S., Purver, M., Wiggins, G.: Words, concepts, and the geometry of analogy. In: Workshop on Semantic Spaces at the Intersection of NLP, Physics and Cognitive Science (2016) 21. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. In: Proceedings of ICLR Workshop (2013) 22. Mikolov, T., tau Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 246–251 (2013) 23. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre- sentation. In: Conf. on Empirical Methods in Natural Language Processing (2014) 24. Schütze, H.: Dimensions of meaning. In: Proceedings of the 1992 ACM/IEEE con- ference on Supercomputing. pp. 787–796 (1992) 25. Shutova, E., Teufel, S., Korhonen, A.: Statistical metaphor processing. Computa- tional Linguistics 39(2), 301–353 (2012) 26. Turney, P.D., Patel, P.: From frequency to meaning: Vector space models of se- mantics. Journal of Artificial Intelligence Research 37, 141–188 (2010) 27. Utsumi, A.: Computational exploration of metaphor comprehension processes using a semantic space model. Cognitive Science 35(2), 251–296 (2011), http://dx.doi.org/10.1111/j.1551-6709.2010.01144.x 28. Veale, T.: A service-oriented architecture for metaphor processing. In: Proceedings of the Second Workshop on Metaphor in NLP. pp. 52–60 (2014) 29. Wolff, P., Gentner, D.: Evidence for role-neutral initial processing of metaphors. Jnl. Experimental Psychology: Learning, Memory, and Cognition 26(2), 529 (2000) 30. Wolff, P., Gentner, D.: Structure-mapping in metaphor comprehension. Cognitive Science 35(8), 1456–1488 (2011) 31. Xiao, P., Alnajjar, K., Granroth-Wilding, M., Agres, K., Toivonen, H.: Meta4meaning: Automatic metaphor interpretation using corpus-derived word as- sociations. In: Proceedings of the 7th International Conference on Computational Creativity (ICCC). Paris, France (2016)