Distributional Analysis of Verbal Neologisms:
                      Task Definition and Dataset Construction
        Matteo Amore                     Stephen McGregor                     Elisabetta Jezek
 University of Pavia / Pavia, Italy     LATTICE - CNRS & École          University of Pavia / Pavia, Italy
  CELI Language Technology /          normale supérieure / Montrouge,     Department of Humanities
           Turin, Italy                           France
                                                                             jezek@unipv.it
    matteo.amore01                     Université Sorbonne nouvelle
@universitadipavia.it                     Paris 3 / Paris, France
                                             semcgregor
                                           @hotmail.com
                                                    just said.1 Reversing the perspective, from the
                1        Abstract                   point of view of the hearers, when they
                                                    encounter a word for the first time, they are
English In this paper we introduce the task         generally capable of making hypotheses about
of interpreting verbal neologism (VNeo) for         the meaning of that word. The process of
the Italian language making use of a highly         understanding unknown words involves the
context-sensitive distributional semantic           employment of previously acquired information.
model (DSM). The task is commonly                   This knowledge can come from various sources:
performed manually by lexicographers                experience of the world, education, and
verifying the contexts in which the VNeo            contextual elements;2 in this contribution we
appear. Developing such a task is likely to be      focus on linguistic contextual (namely co-
of use from a cognitive, social and linguistic      occurrence) information.
perspective. In the following, we first outline        For computational linguistics, neologisms
the motivation for our study and our goal,          raise some intriguing issues: automatic detection
then focus on the construction of the dataset       (especially for languages which do not separate
and the definition of the task.                     written words with blank spaces); lemmatisation;
                                                    POS tagging; semantic analysis; and so forth.
Italian In questo contributo introduciamo un           In this paper we present the task we have
task di interpretazione dei neologismi verbali      developed in order to interpret neologisms, using
(Vneo) in italiano, utilizzando un modello di       a context-sensitive DSM described by McGregor
semantica distribuzionale altamente sensibile       et al. (McGregor et al., 2015). This model was
al contesto. Questa attività è comunemente          built to represent concepts in a spatial
svolta manualmente dai lessicografi, i quali        configuration, making use of a computational
verificano il contesto in cui il Vneo appare.       technique that creates conceptual subspaces.
Sviluppare questo tipo di task può rivelarsi        With the help of this DSM we intend to analyse
utile da una prospettiva linguistica, cognitiva     the behaviour of a sub-group of neologisms,
e sociale. Di seguito presenteremo                  namely verbal neologisms (see Amore 2017 for
inizialmente le motivazioni e gli scopi             more background).
dell’analisi, concentrandoci poi sulla                 Our goal is primarily linguistic. We intend to
costruzione del dataset e sulla definizione del     investigate the interpretation of VNeo, measuring
task.                                               the semantic salience of candidate synonyms by
                                                    way of geometries indicated by an analysis of co-
1 Introduction: motivation and goals                occurrence observations of VNeos. For instance,
                                                    we expect that the VNeo googlare ‘to google’
Studying neologisms can tell us several things.     and a verb like cercare ‘to search’ are
From a lexicographic point of view, neologisms      geometrically related in a subspace specific to
can show trends that a language is following. In    the conceptual context of the neologism.
our opinion, they can also shed light on various
aspects related to linguistic creativity; when      1
speakers use new words (coined by themselves,         This is not the case of neologisms created for
or recently coined by someone else), they expect    advertising,  brand names or marketing purposes in
                                                    general (Lehrer, 2003:380).
that the hearer can understand what they have       2
                                                      All of these aspects are investigated, for example, in
                                                        the field of Contextual Vocabulary Acquisition
                                                        (Rapaport & Ehrlich, 2000).
   The interpretation of neologisms presents two          across a large-scale corpus: it is the plurality of
main challenges: a) analysing verbs using vectors         context which gives these representations their
built only upon co-occurrences (thus excluding            semantic nuance. Second, the spaces generated
argument structures) is notoriously a difficult           by standard approaches like matrix factorisation
task for DSM;3 b) neologisms are, by definition,          and neural networks are abstract, in the sense
words whose frequency is (very) low, because              that their dimensions are not interpretable; as
their use is (still) not widespread. Thus, it             such, typical distributional semantic models are
represents a challenge for DSM models exactly             not sensitive to the context specific way in which
because the vectors for most VNeo will rely               meaning arises in the course of language use.
upon few occurrences. In order to evaluate our            McGregor et al. (2015) have proposed a context-
results, we will compare them with the ones               sensitive approach to distributional semantic
obtained using the Word2Vec model (Mikolov et             modelling that seeks to overcome this second
al., 2013a), and with a gold standard consisting          problem by using contextual information to
in human judgments on semantic relatedness                project semantic representations into lower
(synonymy). The paper is structured as follows.           dimensional conceptual perspectives in an on-
In section 2 we introduce the DSM model that              line way.
we employ in our task, and in section 3 we                   This methodology entails the selection of sets
describe the construction of VNeo dataset and             of dimensions from a base space of co-
the problems we encountered. Finally, in section          occurrence statistics that are in some sense
4 we outline the task and present some                    conceptually salient to the context being
preliminary thoughts on expected results.                 modelled. The selection of salient features
                                                          facilitates the projection of subspaces in which
                                                          the geometric situation of and relationship
2       Distributional Semantic Modelling                 between word-vectors are expected to map to a
                                                          specific conceptual context. This technique has
DSM is a technique for building up measurable,
                                                          been applied to tasks involving context sensitive
computationally tractable lexical semantic
                                                          semantic phenomena such as metaphor rating
representations based on observations of the way
                                                          (Agres et al., 2016), analogy completion
that words co-occur with one another across
                                                          (McGregor et al., 2016), and the classification of
large-scale corpora. This methodology is
                                                          semantic type coercion (McGregor et al., 2017).
grounded in the distributional hypothesis, which
                                                             With regard to the first problem of data
maintains that words that are observed to have
                                                          sparsity, we propose that the facility of the
similar co-occurrence profiles are likely to be
                                                          dynamically contextual approach for handling
semantically related (Harris, 1954; Sahlgren,
                                                          the ad hoc emergence of concepts (Barsalou,
2008). In general, a DSM consists of a high-
                                                          1993) should provide a way of mapping from
dimensional vector space in which words
                                                          relatively few observations of neologisms,
correspond to vectors, and the geometric
                                                          possibly taken outside the data used to build the
relationship between vectors is expected to
                                                          underlying model, to context specific
indicate something about the semantic
                                                          perspectives      on    distributional    semantic
relationship between the associated words. The
                                                          representations.
relationship most typically modelled is general
semantic relatedness, as opposed to more precise          3    Verbal Neologisms: dataset, corpus
indications of, for instance, similarity (Hill et al.,         and lemmatisation
2015), but distributional semantic models have
been effectively applied to tasks ranging from            We will now explain the methodology we use in
language modelling (Bengio, 2009) to metaphor             our analysis, and describe the resources we
classification (Gutiérrez et al., 2016) and the           exploit highlighting their main features.
extrapolation of more fine-grained intensional
correspondences between concepts (Derrac and                   3.1 Sources for the neologisms list
Schockaert, 2015).                                        To select the VNeo to be analysed, we extract
   Standard DSM techniques present two                    data from pre-existing lists of Italian neologisms.
problems for the task of interpreting neologisms.         These lists come from three websites: a)
First,    distributional    representations       are
predicated on many observations of a word
3
    Cf. Bundell et al., 2017 and Chersoni et al., 2016.
treccani.it4     b)      iliesi.cnr.it/ONLI/5    c)         Starting from the corpus, the base DSM is
accademiadellacrusca.it.6 (a) and (b) are                built based on observations of the most frequent
manually compiled and validated: they contain            200,000 words (defined as vocabulary) and their
words manually found in some widely read                 contextual information, considering a co-
newspapers but not (yet) included in Italian             occurrence window of 5 words on either side of a
dictionaries, coherently with the lexicographical        target word. For the purposes of this study, we
definition of neologisms (cf. Adamo & Della              consider the VNeos included in the vocabulary.
Valle 2017). (c) consists of a list of words that,       In this way we obtain the base space.
according to the users of the website, should be            In order to project a subspace contextualised
included in dictionaries. There is no curating of        by a VNeo, we consider the co-occurrence
these suggestions (except the removal of                 features with the highest mutual information
swearwords); thus some neologisms might                  statistics associate with that particular VNeo.
already be included in dictionaries. We chose to         So, for instance, we find the following salient
use this list because it allows analysing words          features:
which are perceived as new from a community of              customizzare        'to      customise'    [city;
Italian speakers. In this way we intend to               modellazione; illustrato; type; batch; editare;
highlight the perspective of the hearers                 nastro; segmentare; preferenza; iconico; ...]
encountering new words.                                     resettare 'to reset' [reset; password;
   Within the lists, we select only the verbs,           formattare; bios; clempad; clementoni; fonera;
obtaining a set of 504 VNeo. Of these VNeo, we           resettare; centralina; router; ...]
check their presence in the itTenTen16 corpus,              googlare 'to google' [telespettatore; pdf;
which we will also use to create the distributional      tecnologia; informazione; addirittura; vi; chiave;
vector space. 340 VNeo are attested in the               invito; risposta; sapere; ...].
corpus: 108 have between 10 and 99                          These features are associated with the
occurrences; 79 between 100 and 999                      maximum mutual information values in terms of
occurrences; and 26 have more than 1000                  their co-occurrence with each of the
occurrences.                                             corresponding input neologisms.
   Instead of using heuristic techniques that               Some other VNeos represented in the
might have identified neologisms within the              vocabulary are: postare ‘to post’, taggare ‘to
corpus (e.g. computing less frequent words and           tag’, twittare ‘to tweet’, spammare ‘to spam’,
manually      checking       their     presence  in      attenzionare ‘to warn’, spoilerare ‘share
dictionaries),7 we chose to rely on lists because        information that reveals plot of a book or film’,
we intend to study words whose use is wider and          bloggare ‘to blog’, loggare ‘to log’, switchare
not restricted only to the web domain.                   ‘to switch’.
                                                            It is worth noting that we create vectors
     3.3 itTenTen16 corpus                               starting from lemmas (not tokens). Our analysis
We conduct an analysis of the itTenTen16                 highlighted the presence of some inaccuracies in
corpus (Jakubíček et al. 2013) because it is the         the automatic lemmatisation of neologisms,8
most up-to-date corpus available for Italian. It is      which was already present in the original
also a web-based corpus, and so particularly well        corpus.9 In a future investigation we are planning
fitted to examine neologisms: in fact, the web           to compare the results produced with the original
and IT domain is a notable source of new words           lemmatised corpus against the results obtained
and, especially, of new loanwords. As the corpus         from a corpus version, where the lemmatisation
dimensions are sizeable (4.9 billion tokens), we         will be corrected. This correction process might
will use a random sample of the full corpus for          be performed using regular expressions, in order
purposes of computability. This sample will
correspond to ⅕ of the original corpus.
4
  http://www.treccani.it/magazine/lingua_italiana/
                                                         8
neologismi (last consulted 10/04/2018)                     Neologisms are not stored in common word-lists,
5
  http://www.iliesi.cnr.it/ONLI/BD.php (last consulted   and they are (usually) rare words, thus presenting
02/05/2018)                                              difficulties for machine learning techniques.
6                                                        9
  http://www.accademiadellacrusca.it/it/lingua-            The lemmatisation is obtained using the TreeTagger
italiana/parole-nuove (last consulted 02/05/2018)        tool (Schmid, 1994) with Baroni’s parameter file
7
  We are aware that this might correspond to the loss    (http://www.cis.uni-
of some other neologisms contained in the corpus.        muenchen.de/~schmid/tools/TreeTagger/)
to     capture      specific     VNeos        token.10    terms like cercare ‘search’ using geometric
                                                          techniques.
                                                             Context can be defined in an open ended way
                                                          in these models. For instance, the salient co-
                                                          occurrence features of a single word can be used
                                                          to generate a subspace. Small sets of words,
                                                          either components of observed compositions
                                                          (McGregor et al., 2017) or groups of
                                                          conceptually related terms (McGregor et al.,
Figure 1: Two subspaces projected based on two            2015) have also been used to generate
co-occurrence dimensions closely associated               semantically productive subspaces. In the small
with the words (a) vaped and vaping, and (b)              example illustrated in Figure 1, on the other
trolled and trolling, as observed in a small set of       hand, dimensions are defined explicitly in terms
recent posts on Twitter. Among vectors for a              of the salient words associated with a small
number of candidate interpretations of                    number of very recent observations of two
neologisms, we see appropriate interpretations            different neologisms in use, specifically
emerging based on distance from the origin in             extrapolated from the salient co-occurrence
each contextualised subspace, based on PMI                features of Twitter posts in which the targeted
statistics extrapolated from co-occurrences               neologisms are mentioned.
observed across English language Wikipedia.                  Contextualised subspaces can be explored in
                                                          terms of the geometric features of word-vectors
4. Interpreting       VNeo     using     geometrical      projected into those subspaces. So, for instance,
subspaces                                                 McGregor et al. (2015) propose a norm method,
                                                          by which word-vectors salient in a particular
As referenced in §1, our goal is to verify whether        context will emerge as being far from the origin.
the meaning of a neologism can be induced from            This phenomenon is observed with appropriate
its context through distributional techniques, in         interpretations percolating into the salient
particular by discovering verbs with salient              regions even in the low-dimensional toy
geometric features in a contextualised subspace.          examples illustrated in Figure 1, which involves
   To this end, we organize the task as follows.          a dynamically contextual DSM built from
Starting from a subset of the most frequent               English language Wikipedia. Choices about
VNeos found in the corpus (§3), we first build            context     selection    techniques,      geometric
subspaces for VNeos using the DSM model                   characteristics of subspaces to be explored, and
presented in §2. Subspaces are created by                 modelling parameters including dimensionality
selecting the sets of dimensions that are                 of projections will be the subject of our
conceptually salient to the context being                 forthcoming experiments.
modelled: each dimension in a subspace                       In order to evaluate the model, we will
corresponds to a specific co-occurrence feature           compare our results against the results obtained
(i.e. a word). By finding a whole set of co-              applying the Word2Vec model to the same
occurrences and using these to generate a                 corpus (Mikolov et al., 2013a).
relatively high-dimensional projection, we hope              With further investigations we will also test
to establish a general contextualised conceptual          this model using a gold standard consisting of
profile and to overcome the peculiarities                 human judgments on VNeos interpretations
associated with low-frequency targets. For                collected for this purpose. Similarity judgments
example, if the model finds that googlare ‘to             will be provided by two native speakers with
google’ co-occurs with words like nome ‘name’,            significant     background       in      linguistics.
indirizzo ‘address’, and sito ‘website’, we use           Specifically, the dataset will consist of verb pairs
those co-occurrences as a basis for a projection          in which VNeo are grouped with more common
of a subspace in which one could predict to find          verbs (googlare and cercare) based on human
                                                          ratings collected in the form of a TOEFL-like
10
                                                          multiple-choice synonymy test.11
   Regular expressions might be useful, within the
corpus, to find an inflected form of a verb
(lemmatised as it is) and replace it with the correct     11
                                                             Here the task is to determine, for a number of target
lemma: e.g. find lemma googlav. (meaning                  words, the closest synonym from a choice of four
googlavo, googlavi, etc.) and replace it with googlare.   alternatives.
4     Conclusion                                           Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen
                                                           Clark. 2011. Mathematical foundations for a
The aim of the task presented here is to                   compositional distributed model of meaning.
investigate the importance of linguistic context           Linguistic Analysis, 36:345–384.
for the interpretation of neologisms, grounding            Joaquı́n Derrac and Steven Schockaert. 2015.
the analysis in a context-sensitive DSM. With              Inducing semantic relations from conceptual
this task we intend to tackle issues connected             spaces: A data-driven approach to plausible
with creativity processes and the environmental            reasoning. Artificial Intelligence, 228:66–94.
(contextual) sensibility typical of human                  E. Darı́o Gutiérrez, Ekaterina Shutova, Tyler
cognition. In addition, we apply, for the first            Marghetis, and Benjamin K. Bergen. 2016. Literal
time, this DSM to Italian, providing a new                 and metaphorical senses in compositional
semantic resource for the analysis of the                  distributional semantic models. In Proceedings of
language. Further studies may compare our                  the 54th Annual Meeting of the Association for
results with other DSMs, and/or study what the             Computational Linguistics.
semantic relations found with this specific              Zellig Harris. 1954. Distributional structure. Word,
approach reveal about other phenomena                      10(23):146–162.
belonging to different linguistic levels (e.g.
syntax).                                                 Felix Hill, Roi Reichart, and Anna Korhonen. 2015.
                                                            Simlex-999: Evaluating semantic models with
                                                            genuine similarity estimation. Computational
References                                                  Linguistics, 41(4):665–695.
Giovanni Adamo and Valeria Della Valle. 2017. Che        Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář,
  cos’è un neologismo?. Carocci Editore, Roma.             Pavel Rychlỳ, and Vít Suchomel. 2013. The tenten
Kat Agres, Stephen McGregor, Karolina Rataj,               corpus family. In 7th International Corpus
  Matthew Purver, and Geraint A. Wiggins. 2016.            Linguistics Conference CL, pages 125–127.
  Modeling metaphor perception with distributional       Adrienne Lehrer. 2003. Understanding trendy
  semantics vector space models. In Workshop on            neologisms. Italian Journal of Linguistics, 15:369–
  Computational Creativity, Concept Invention, and         382.
  General Intelligence, 08/2016.
                                                         Stephen McGregor, Kat Agres, Matthew Purver, and
Matteo Amore. 2017. I Verbi Neologici nell’Italiano         Geraint Wiggins. 2015. From distributional
  del Web: Comportamento Sintattico e Selezione             semantics to conceptual spaces: A novel
  dell’Ausiliare. In Proceedings of the Fourth Italian      computational method for concept creation.
  Conference on Computational Linguistics (CLiC-it          Journal of Artificial General Intelligence, 6(1):55–
  2017), Rome, Italy, December 11-13, 2017.                 89.
Lawrence W. Barsalou. 1993. Flexibility, structure,      Stephen McGregor, Matthew Purver, and Geraint
  and linguistic vagary in concepts: Manifestations         Wiggins. 2016. Words, concepts, and the geometry
  of a compositional system of perceptual symbols.          of analogy. In Proceedings of the Workshop on
  In A.C. Collins, S.E. Gathercole, and M.A.                Semantic Spaces at the Intersection of NLP,
  Conway, editors, Theories of memory, pages 29–            Physics and Cognitive Science (SLPCS), pages 39–
  101. Lawrence Erlbaum Associates, London.                 48.
Yoshue Bengio. 2009. Learning deep architecture for      Stephen McGregor, Elisabetta Jezek, Matthew Purver,
  AI. Machine Learning, 2(1):1–127.                         and Geraint Wiggins. 2017. A geometric method
Benjamin Blundell, Mehrnoosh Sadrzadeh, Elisabetta          for detecting semantic coercion. In Proceedings of
  Jezek. 2017. Experimental Results on Exploiting           12th International Workshop on Computational
  Predicate-Argument Structure for Verb Similarity          Semantics.
  in Distributional Semantics. In Clasp Papers in        Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey
  Computational Linguistics, vol. 1, pages 99-106.         Dean. 2013a. Efficient Estimation of Word
    Emmanuele Chersoni, Enrico Santus, Alessandro          Representations in Vector Space. In ICLR
    Lenci, Philippe Blache, Chu-Ren Huang 2016.            Workshop Papers.
    Representing Verbs with Rich Contexts: an            Jeff Mitchell and Mirella Lapata. 2010. Composition
    Evaluation on Verb Similarity, Proceedings of the       in Distributional Models of Semantics. Cognitive
    2016 Conference on Empirical Methods in Natural         Science 34:1388–1429.
    Language      Processing      Association     for
    Computational Linguistics, pages 1967–1972.          Matthew E. Peters, Mark Neumann, Mohit Iyyer,
                                                           Matt Gardner, Christopher Clark, Kenton Lee, and
                                                           Luke Zettlemoyer. 2018. Deep contextualized word
  representations. Proceedings of NAACL-HLT 2018,
  pages 2227–2237.
William J. Rapaport and Karen Ehrlich. 2000. A
  computational theory of vocabulary acquisition. In
  Stuart Charles Shapiro and Lucja M. Iwánska,
  editors, Natural language processing and
  knowledge representation: language for knowledge
  and knowledge for language. MIT Press,
  Cambridge, MA.
Magnus Sahlgren. 2008. The distributional
  hypothesis. Italian Journal of Linguistics,
  20(1):33–54.
Helmut Schmid. 1994. Probabilistic Part-of-Speech
  Tagging Using Decision Trees. In Proceedings of
  International Conference on New Methods in
  Language Processing. Manchester, UK.