Distributional Analysis of Verbal Neologisms: Task Definition and Dataset Construction Matteo Amore Stephen McGregor Elisabetta Jezek University of Pavia / Pavia, Italy LATTICE - CNRS & École University of Pavia / Pavia, Italy CELI Language Technology / normale supérieure / Montrouge, Department of Humanities Turin, Italy France jezek@unipv.it matteo.amore01 Université Sorbonne nouvelle @universitadipavia.it Paris 3 / Paris, France semcgregor @hotmail.com just said.1 Reversing the perspective, from the 1 Abstract point of view of the hearers, when they encounter a word for the first time, they are English In this paper we introduce the task generally capable of making hypotheses about of interpreting verbal neologism (VNeo) for the meaning of that word. The process of the Italian language making use of a highly understanding unknown words involves the context-sensitive distributional semantic employment of previously acquired information. model (DSM). The task is commonly This knowledge can come from various sources: performed manually by lexicographers experience of the world, education, and verifying the contexts in which the VNeo contextual elements;2 in this contribution we appear. Developing such a task is likely to be focus on linguistic contextual (namely co- of use from a cognitive, social and linguistic occurrence) information. perspective. In the following, we first outline For computational linguistics, neologisms the motivation for our study and our goal, raise some intriguing issues: automatic detection then focus on the construction of the dataset (especially for languages which do not separate and the definition of the task. written words with blank spaces); lemmatisation; POS tagging; semantic analysis; and so forth. Italian In questo contributo introduciamo un In this paper we present the task we have task di interpretazione dei neologismi verbali developed in order to interpret neologisms, using (Vneo) in italiano, utilizzando un modello di a context-sensitive DSM described by McGregor semantica distribuzionale altamente sensibile et al. (McGregor et al., 2015). This model was al contesto. Questa attività è comunemente built to represent concepts in a spatial svolta manualmente dai lessicografi, i quali configuration, making use of a computational verificano il contesto in cui il Vneo appare. technique that creates conceptual subspaces. Sviluppare questo tipo di task può rivelarsi With the help of this DSM we intend to analyse utile da una prospettiva linguistica, cognitiva the behaviour of a sub-group of neologisms, e sociale. Di seguito presenteremo namely verbal neologisms (see Amore 2017 for inizialmente le motivazioni e gli scopi more background). dell’analisi, concentrandoci poi sulla Our goal is primarily linguistic. We intend to costruzione del dataset e sulla definizione del investigate the interpretation of VNeo, measuring task. the semantic salience of candidate synonyms by way of geometries indicated by an analysis of co- 1 Introduction: motivation and goals occurrence observations of VNeos. For instance, we expect that the VNeo googlare ‘to google’ Studying neologisms can tell us several things. and a verb like cercare ‘to search’ are From a lexicographic point of view, neologisms geometrically related in a subspace specific to can show trends that a language is following. In the conceptual context of the neologism. our opinion, they can also shed light on various aspects related to linguistic creativity; when 1 speakers use new words (coined by themselves, This is not the case of neologisms created for or recently coined by someone else), they expect advertising, brand names or marketing purposes in general (Lehrer, 2003:380). that the hearer can understand what they have 2 All of these aspects are investigated, for example, in the field of Contextual Vocabulary Acquisition (Rapaport & Ehrlich, 2000). The interpretation of neologisms presents two across a large-scale corpus: it is the plurality of main challenges: a) analysing verbs using vectors context which gives these representations their built only upon co-occurrences (thus excluding semantic nuance. Second, the spaces generated argument structures) is notoriously a difficult by standard approaches like matrix factorisation task for DSM;3 b) neologisms are, by definition, and neural networks are abstract, in the sense words whose frequency is (very) low, because that their dimensions are not interpretable; as their use is (still) not widespread. Thus, it such, typical distributional semantic models are represents a challenge for DSM models exactly not sensitive to the context specific way in which because the vectors for most VNeo will rely meaning arises in the course of language use. upon few occurrences. In order to evaluate our McGregor et al. (2015) have proposed a context- results, we will compare them with the ones sensitive approach to distributional semantic obtained using the Word2Vec model (Mikolov et modelling that seeks to overcome this second al., 2013a), and with a gold standard consisting problem by using contextual information to in human judgments on semantic relatedness project semantic representations into lower (synonymy). The paper is structured as follows. dimensional conceptual perspectives in an on- In section 2 we introduce the DSM model that line way. we employ in our task, and in section 3 we This methodology entails the selection of sets describe the construction of VNeo dataset and of dimensions from a base space of co- the problems we encountered. Finally, in section occurrence statistics that are in some sense 4 we outline the task and present some conceptually salient to the context being preliminary thoughts on expected results. modelled. The selection of salient features facilitates the projection of subspaces in which the geometric situation of and relationship 2 Distributional Semantic Modelling between word-vectors are expected to map to a specific conceptual context. This technique has DSM is a technique for building up measurable, been applied to tasks involving context sensitive computationally tractable lexical semantic semantic phenomena such as metaphor rating representations based on observations of the way (Agres et al., 2016), analogy completion that words co-occur with one another across (McGregor et al., 2016), and the classification of large-scale corpora. This methodology is semantic type coercion (McGregor et al., 2017). grounded in the distributional hypothesis, which With regard to the first problem of data maintains that words that are observed to have sparsity, we propose that the facility of the similar co-occurrence profiles are likely to be dynamically contextual approach for handling semantically related (Harris, 1954; Sahlgren, the ad hoc emergence of concepts (Barsalou, 2008). In general, a DSM consists of a high- 1993) should provide a way of mapping from dimensional vector space in which words relatively few observations of neologisms, correspond to vectors, and the geometric possibly taken outside the data used to build the relationship between vectors is expected to underlying model, to context specific indicate something about the semantic perspectives on distributional semantic relationship between the associated words. The representations. relationship most typically modelled is general semantic relatedness, as opposed to more precise 3 Verbal Neologisms: dataset, corpus indications of, for instance, similarity (Hill et al., and lemmatisation 2015), but distributional semantic models have been effectively applied to tasks ranging from We will now explain the methodology we use in language modelling (Bengio, 2009) to metaphor our analysis, and describe the resources we classification (Gutiérrez et al., 2016) and the exploit highlighting their main features. extrapolation of more fine-grained intensional correspondences between concepts (Derrac and 3.1 Sources for the neologisms list Schockaert, 2015). To select the VNeo to be analysed, we extract Standard DSM techniques present two data from pre-existing lists of Italian neologisms. problems for the task of interpreting neologisms. These lists come from three websites: a) First, distributional representations are predicated on many observations of a word 3 Cf. Bundell et al., 2017 and Chersoni et al., 2016. treccani.it4 b) iliesi.cnr.it/ONLI/5 c) Starting from the corpus, the base DSM is accademiadellacrusca.it.6 (a) and (b) are built based on observations of the most frequent manually compiled and validated: they contain 200,000 words (defined as vocabulary) and their words manually found in some widely read contextual information, considering a co- newspapers but not (yet) included in Italian occurrence window of 5 words on either side of a dictionaries, coherently with the lexicographical target word. For the purposes of this study, we definition of neologisms (cf. Adamo & Della consider the VNeos included in the vocabulary. Valle 2017). (c) consists of a list of words that, In this way we obtain the base space. according to the users of the website, should be In order to project a subspace contextualised included in dictionaries. There is no curating of by a VNeo, we consider the co-occurrence these suggestions (except the removal of features with the highest mutual information swearwords); thus some neologisms might statistics associate with that particular VNeo. already be included in dictionaries. We chose to So, for instance, we find the following salient use this list because it allows analysing words features: which are perceived as new from a community of customizzare 'to customise' [city; Italian speakers. In this way we intend to modellazione; illustrato; type; batch; editare; highlight the perspective of the hearers nastro; segmentare; preferenza; iconico; ...] encountering new words. resettare 'to reset' [reset; password; Within the lists, we select only the verbs, formattare; bios; clempad; clementoni; fonera; obtaining a set of 504 VNeo. Of these VNeo, we resettare; centralina; router; ...] check their presence in the itTenTen16 corpus, googlare 'to google' [telespettatore; pdf; which we will also use to create the distributional tecnologia; informazione; addirittura; vi; chiave; vector space. 340 VNeo are attested in the invito; risposta; sapere; ...]. corpus: 108 have between 10 and 99 These features are associated with the occurrences; 79 between 100 and 999 maximum mutual information values in terms of occurrences; and 26 have more than 1000 their co-occurrence with each of the occurrences. corresponding input neologisms. Instead of using heuristic techniques that Some other VNeos represented in the might have identified neologisms within the vocabulary are: postare ‘to post’, taggare ‘to corpus (e.g. computing less frequent words and tag’, twittare ‘to tweet’, spammare ‘to spam’, manually checking their presence in attenzionare ‘to warn’, spoilerare ‘share dictionaries),7 we chose to rely on lists because information that reveals plot of a book or film’, we intend to study words whose use is wider and bloggare ‘to blog’, loggare ‘to log’, switchare not restricted only to the web domain. ‘to switch’. It is worth noting that we create vectors 3.3 itTenTen16 corpus starting from lemmas (not tokens). Our analysis We conduct an analysis of the itTenTen16 highlighted the presence of some inaccuracies in corpus (Jakubíček et al. 2013) because it is the the automatic lemmatisation of neologisms,8 most up-to-date corpus available for Italian. It is which was already present in the original also a web-based corpus, and so particularly well corpus.9 In a future investigation we are planning fitted to examine neologisms: in fact, the web to compare the results produced with the original and IT domain is a notable source of new words lemmatised corpus against the results obtained and, especially, of new loanwords. As the corpus from a corpus version, where the lemmatisation dimensions are sizeable (4.9 billion tokens), we will be corrected. This correction process might will use a random sample of the full corpus for be performed using regular expressions, in order purposes of computability. This sample will correspond to ⅕ of the original corpus. 4 http://www.treccani.it/magazine/lingua_italiana/ 8 neologismi (last consulted 10/04/2018) Neologisms are not stored in common word-lists, 5 http://www.iliesi.cnr.it/ONLI/BD.php (last consulted and they are (usually) rare words, thus presenting 02/05/2018) difficulties for machine learning techniques. 6 9 http://www.accademiadellacrusca.it/it/lingua- The lemmatisation is obtained using the TreeTagger italiana/parole-nuove (last consulted 02/05/2018) tool (Schmid, 1994) with Baroni’s parameter file 7 We are aware that this might correspond to the loss (http://www.cis.uni- of some other neologisms contained in the corpus. muenchen.de/~schmid/tools/TreeTagger/) to capture specific VNeos token.10 terms like cercare ‘search’ using geometric techniques. Context can be defined in an open ended way in these models. For instance, the salient co- occurrence features of a single word can be used to generate a subspace. Small sets of words, either components of observed compositions (McGregor et al., 2017) or groups of conceptually related terms (McGregor et al., Figure 1: Two subspaces projected based on two 2015) have also been used to generate co-occurrence dimensions closely associated semantically productive subspaces. In the small with the words (a) vaped and vaping, and (b) example illustrated in Figure 1, on the other trolled and trolling, as observed in a small set of hand, dimensions are defined explicitly in terms recent posts on Twitter. Among vectors for a of the salient words associated with a small number of candidate interpretations of number of very recent observations of two neologisms, we see appropriate interpretations different neologisms in use, specifically emerging based on distance from the origin in extrapolated from the salient co-occurrence each contextualised subspace, based on PMI features of Twitter posts in which the targeted statistics extrapolated from co-occurrences neologisms are mentioned. observed across English language Wikipedia. Contextualised subspaces can be explored in terms of the geometric features of word-vectors 4. Interpreting VNeo using geometrical projected into those subspaces. So, for instance, subspaces McGregor et al. (2015) propose a norm method, by which word-vectors salient in a particular As referenced in §1, our goal is to verify whether context will emerge as being far from the origin. the meaning of a neologism can be induced from This phenomenon is observed with appropriate its context through distributional techniques, in interpretations percolating into the salient particular by discovering verbs with salient regions even in the low-dimensional toy geometric features in a contextualised subspace. examples illustrated in Figure 1, which involves To this end, we organize the task as follows. a dynamically contextual DSM built from Starting from a subset of the most frequent English language Wikipedia. Choices about VNeos found in the corpus (§3), we first build context selection techniques, geometric subspaces for VNeos using the DSM model characteristics of subspaces to be explored, and presented in §2. Subspaces are created by modelling parameters including dimensionality selecting the sets of dimensions that are of projections will be the subject of our conceptually salient to the context being forthcoming experiments. modelled: each dimension in a subspace In order to evaluate the model, we will corresponds to a specific co-occurrence feature compare our results against the results obtained (i.e. a word). By finding a whole set of co- applying the Word2Vec model to the same occurrences and using these to generate a corpus (Mikolov et al., 2013a). relatively high-dimensional projection, we hope With further investigations we will also test to establish a general contextualised conceptual this model using a gold standard consisting of profile and to overcome the peculiarities human judgments on VNeos interpretations associated with low-frequency targets. For collected for this purpose. Similarity judgments example, if the model finds that googlare ‘to will be provided by two native speakers with google’ co-occurs with words like nome ‘name’, significant background in linguistics. indirizzo ‘address’, and sito ‘website’, we use Specifically, the dataset will consist of verb pairs those co-occurrences as a basis for a projection in which VNeo are grouped with more common of a subspace in which one could predict to find verbs (googlare and cercare) based on human ratings collected in the form of a TOEFL-like 10 multiple-choice synonymy test.11 Regular expressions might be useful, within the corpus, to find an inflected form of a verb (lemmatised as it is) and replace it with the correct 11 Here the task is to determine, for a number of target lemma: e.g. find lemma googlav. (meaning words, the closest synonym from a choice of four googlavo, googlavi, etc.) and replace it with googlare. alternatives. 4 Conclusion Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. 2011. Mathematical foundations for a The aim of the task presented here is to compositional distributed model of meaning. investigate the importance of linguistic context Linguistic Analysis, 36:345–384. for the interpretation of neologisms, grounding Joaquı́n Derrac and Steven Schockaert. 2015. the analysis in a context-sensitive DSM. With Inducing semantic relations from conceptual this task we intend to tackle issues connected spaces: A data-driven approach to plausible with creativity processes and the environmental reasoning. Artificial Intelligence, 228:66–94. (contextual) sensibility typical of human E. Darı́o Gutiérrez, Ekaterina Shutova, Tyler cognition. In addition, we apply, for the first Marghetis, and Benjamin K. Bergen. 2016. Literal time, this DSM to Italian, providing a new and metaphorical senses in compositional semantic resource for the analysis of the distributional semantic models. In Proceedings of language. Further studies may compare our the 54th Annual Meeting of the Association for results with other DSMs, and/or study what the Computational Linguistics. semantic relations found with this specific Zellig Harris. 1954. Distributional structure. Word, approach reveal about other phenomena 10(23):146–162. belonging to different linguistic levels (e.g. syntax). Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with genuine similarity estimation. Computational References Linguistics, 41(4):665–695. Giovanni Adamo and Valeria Della Valle. 2017. Che Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, cos’è un neologismo?. Carocci Editore, Roma. Pavel Rychlỳ, and Vít Suchomel. 2013. The tenten Kat Agres, Stephen McGregor, Karolina Rataj, corpus family. In 7th International Corpus Matthew Purver, and Geraint A. Wiggins. 2016. Linguistics Conference CL, pages 125–127. Modeling metaphor perception with distributional Adrienne Lehrer. 2003. Understanding trendy semantics vector space models. In Workshop on neologisms. Italian Journal of Linguistics, 15:369– Computational Creativity, Concept Invention, and 382. General Intelligence, 08/2016. Stephen McGregor, Kat Agres, Matthew Purver, and Matteo Amore. 2017. I Verbi Neologici nell’Italiano Geraint Wiggins. 2015. From distributional del Web: Comportamento Sintattico e Selezione semantics to conceptual spaces: A novel dell’Ausiliare. In Proceedings of the Fourth Italian computational method for concept creation. Conference on Computational Linguistics (CLiC-it Journal of Artificial General Intelligence, 6(1):55– 2017), Rome, Italy, December 11-13, 2017. 89. Lawrence W. Barsalou. 1993. Flexibility, structure, Stephen McGregor, Matthew Purver, and Geraint and linguistic vagary in concepts: Manifestations Wiggins. 2016. Words, concepts, and the geometry of a compositional system of perceptual symbols. of analogy. In Proceedings of the Workshop on In A.C. Collins, S.E. Gathercole, and M.A. Semantic Spaces at the Intersection of NLP, Conway, editors, Theories of memory, pages 29– Physics and Cognitive Science (SLPCS), pages 39– 101. Lawrence Erlbaum Associates, London. 48. Yoshue Bengio. 2009. Learning deep architecture for Stephen McGregor, Elisabetta Jezek, Matthew Purver, AI. Machine Learning, 2(1):1–127. and Geraint Wiggins. 2017. A geometric method Benjamin Blundell, Mehrnoosh Sadrzadeh, Elisabetta for detecting semantic coercion. In Proceedings of Jezek. 2017. Experimental Results on Exploiting 12th International Workshop on Computational Predicate-Argument Structure for Verb Similarity Semantics. in Distributional Semantics. In Clasp Papers in Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Computational Linguistics, vol. 1, pages 99-106. Dean. 2013a. Efficient Estimation of Word Emmanuele Chersoni, Enrico Santus, Alessandro Representations in Vector Space. In ICLR Lenci, Philippe Blache, Chu-Ren Huang 2016. Workshop Papers. Representing Verbs with Rich Contexts: an Jeff Mitchell and Mirella Lapata. 2010. Composition Evaluation on Verb Similarity, Proceedings of the in Distributional Models of Semantics. Cognitive 2016 Conference on Empirical Methods in Natural Science 34:1388–1429. Language Processing Association for Computational Linguistics, pages 1967–1972. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. Proceedings of NAACL-HLT 2018, pages 2227–2237. William J. Rapaport and Karen Ehrlich. 2000. A computational theory of vocabulary acquisition. In Stuart Charles Shapiro and Lucja M. Iwánska, editors, Natural language processing and knowledge representation: language for knowledge and knowledge for language. MIT Press, Cambridge, MA. Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Linguistics, 20(1):33–54. Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing. Manchester, UK.