<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>“Ti blocco perché sei un trollazzo”. Lexical Innovation in Contemporary Italian in a Large Twitter Corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Brasolin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Greta H. Franzini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefania Spina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eurac Research (Institute for Applied Linguistics)</institution>
          ,
          <addr-line>Viale Druso 1, 39100 Bolzano BZ</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University for Foreigners of Perugia</institution>
          ,
          <addr-line>Piazza Fortebraccio 4, 06123 Perugia PG</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>93</fpage>
      <lpage>105</lpage>
      <abstract>
        <p>This study investigates emerging vocabulary in contemporary Italian in a corpus of 5.32 M timestamped and geotagged tweets extracted from the Italian timeline throughout 2022. We automatically identify and manually distill 8 133 candidate neologisms down to 346 unattested word forms, shedding light on their spatio-temporal circulation patterns.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;twitter</kwd>
        <kwd>social media</kwd>
        <kwd>corpora</kwd>
        <kwd>italian</kwd>
        <kwd>lexical innovation</kwd>
        <kwd>language change</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>rary Italian stemming from Twitter interactions using the
2022 Italian timeline as a source; social media represents
Lexical innovation is one of the driving mechanisms of an opportunity to analyse new word forms surfacing
language change [1, 2]: through the creation of new in everyday conversation, and provide vast amounts of
words1 and their integration into existing lexical sys- data produced in real time by a large, heterogeneous
tems [3], languages evolve and adapt to new social and and representative sample of speakers. Furthermore, the
technological contexts, which are constantly and rapidly availability of geotagged texts enables the investigation
changing. The process of creating new words can be ap- of possible patterns of lexical innovation related to
speproached from diferent standpoints. Firstly, the choice cific geographical areas [ 7]. This possibility is
particuof sources necessary to trace the process of lexical in- larly promising in languages, like Italian, characterised
novation has great methodological relevance. One of by deep and articulated geographical variation. On the
the main traditional sources have been newspaper texts, other hand, we propose a novel methodology to process
which have the double benefit of being easily available and filter word forms acquired from a sizeable Twitter
and quantitatively relevant [4]. Secondly, lexical innova- corpus, with the aim of detecting those that represent
tion follows diferent steps and usually develops from the the best candidates to become new words.
initial emergence of new words in specific contexts to The result of the study is a list of 346 word forms,
clastheir proliferation to wider contexts and domains. This sified into 15 categories based on the linguistic process
process may end with the institutionalisation of new of lexical creation and yet unattested in two of the most
word forms [5, 6] through their inclusion in dictionaries up-to-date Italian lexicographic resources.
and consolidation in standard use. Thirdly, the
linguistic processes leading to the creation of new words can
be diferent and can include phenomena of derivation, 2. Related Work
composition, transcategorisation, creation of
portmanteau forms, semantic shifts, and borrowing from other
languages.</p>
      <sec id="sec-1-1">
        <title>The aim of this study is twofold. On the one hand, we present an analysis of emerging vocabulary in contempo</title>
      </sec>
      <sec id="sec-1-2">
        <title>Studies on lexical innovation in Italian have a long tra</title>
        <p>dition [8], and have produced extensive lexicographic
works dedicated to neologisms (e.g., [9], to mention one
of the most recent), as well as a vast body of research (e.g.,
[10], [11] and [12]). One of the most widely discussed
topics is the classification of the linguistic processes leading
CNLoviC3-0it—20D2e3c: 092th,2I0t2a3li,aVnenCiocne,feItraelnyce on Computational Linguistics, to the creation and spread of new words.
* Corresponding author. Traditionally, it is acknowledged that the means by
$ paolo.brasolin@eurac.edu (P. Brasolin); which languages enrich their vocabulary are essentially
greta.franzini@eurac.edu (G. H. Franzini); four: the acquisition of new elements from other
lanstefania.spina@unistrapg.it (S. Spina) guages, the formation of new words from pre-existing
 h00tt0p0s-:0//0p0a3o-2lo4b7r1a-s7o7l9in7.g(Pit.hBurba.sioo/li(nP).; 0B0r0a0so-0li0n0)3-1159-5575 lexical elements, the change of grammatical category and
(G. H. Franzini); 0000-0002-9957-3903 (S. Spina) the shift in the meaning of words already in use [13]. In
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License the last few decades, the Osservatorio neologico della
lin</p>
        <sec id="sec-1-2-1">
          <title>CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)</title>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>1In this paper, “word” and “form” are used interchangeably.</title>
        <p>gua italiana2 (ONLI) [4] has been tracking new words
emerging in Italian newspapers, producing a database
which, to date, includes 2 986 forms with definition, date
of attestation and first retrieved occurrence in the press.</p>
        <p>More recently, several studies have highlighted the
benefits of using social media to track new word forms
cropping up in informal contexts, such as everyday
conversation, as opposed to newspaper texts, which are more
formal and draw from diferent registers [ 14, 15, 16].
Additionally, as a populous repository of conversations held
in real time by a large number of speakers, social media
can capture lexical creativity originating in communities
of people rather than inventive journalism [17]. This
use of social media has produced a number of studies
[18, 7, 19] focussed on the initial and less documented
phase of the lexical innovation process, right after the
words’ creation and first use, and well before their final
institutionalisation and inclusion in dictionaries [5, 6].</p>
        <p>It is well-known that only a small portion of the words
coined in everyday language use become new entries in
dictionaries and thus part of the vocabulary: many
remain ephemeral but are nevertheless compelling, as they
provide evidence of the linguistic mechanisms driving
the lexical innovation process. Generally, social media
allow researchers to extract and use an unprecedented
amount of conversational data [20, 21], which can
provide reliable computations of lexical innovation and thus
give a significant boost to the study of language variation
and change [22, 23].</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Corpus</title>
      <sec id="sec-2-1">
        <title>In order to investigate emerging vocabulary in contem</title>
        <p>porary Italian, we used a corpus of timestamped and
geotagged tweets extracted from the Italian Twitter timeline
throughout 2022. The corpus comprises 5.32 M tweets
written by 153 k unique users, amounting to 71.5 M
tokens (or 564 M characters).</p>
      </sec>
      <sec id="sec-2-2">
        <title>To the best of our knowledge, this is the first and</title>
        <p>largest study yet to address lexical innovation in Italian</p>
      </sec>
      <sec id="sec-2-3">
        <title>Twitter. Regrettably, this could also be the last. The re</title>
        <p>cent takeover of Twitter collapsed its value for academia:
as of summer 2023, publicly accessible data has been
severely restricted, API prices have sharply risen, and
academic access has been cancelled outright.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Methodology</title>
      <sec id="sec-3-1">
        <title>Manual annotation aside, all our procedures are imple</title>
        <p>mented as code and organised into a series of modular
stages. To facilitate operation, they are accompanied and</p>
      </sec>
      <sec id="sec-3-2">
        <title>2https://www.iliesi.cnr.it/ONLI/intro.php</title>
        <p>Condition
coordinated by an executable dependency tree
specifying the relations between them, their inputs and their
outputs. Together, they constitute a cohesive and
reproducible data pipeline.</p>
      </sec>
      <sec id="sec-3-3">
        <title>We exclusively used Open Source Software, mostly</title>
        <p>in the form of well-known Python packages and GNU3
tools. An exhaustive list including version numbers can
be found in Appendix A.</p>
      </sec>
      <sec id="sec-3-4">
        <title>In the following, we only discuss the general implementation design. The full source code is documented and available in [24].</title>
        <p>4.1. Acquisition
Our corpus samples the Italian Twitter timeline of 2022.</p>
      </sec>
      <sec id="sec-3-5">
        <title>We define this notion as the conjunction of the conditions</title>
        <p>listed in Table 1, expressed using Twitter’s advanced
search query language4.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Thus, our corpus is a subset of the results given by the search combining the aforementioned conditions at the time of sampling.</title>
        <p>4.2. Preparation
4.2.1. Geographic Data
Tweets can bear geolocation data in two independent
forms: a latitude/longitude pair and an association with a
place. A place is an administrative division or a point of
interest and it is characterised by an id, a country code,
a geographical bounding box and other metadata. In our
corpus, 99.43 % of tweets bear a place, 0.04 % only bear
a lat./long. pair, and 0.53 % bear neither5. Consequently,
despite lat./long. pairs being more precise, we chose to
deal with places only, as they cover the vast majority of
tweets and already include the country code necessary
to restrict the data exactly to Italy.</p>
      </sec>
      <sec id="sec-3-7">
        <title>We extracted 34.8 k unique places, keeping their id</title>
        <p>and country code (47.0 % are IT), and computed the</p>
      </sec>
      <sec id="sec-3-8">
        <title>3https://www.gnu.org/</title>
      </sec>
      <sec id="sec-3-9">
        <title>4Extensive unoficial documentation for the query language is avail</title>
        <p>able at https://github.com/igorbrigadir/twitter-advanced-search/.
The user interface is found at https://www.twitter.com/
search-advanced.</p>
      </sec>
      <sec id="sec-3-10">
        <title>5This is possible because Twitter data can be redacted.</title>
        <p>"Hi #twitter !" ↦→
range of hashtag entity
"Hi □ #twitter □ !"</p>
        <p>U+E000 U+E001
centroid of their bounding box as a reference point for
geographical calculations.
all emojis with spaces, lowercased the whole text, and
4.2.2. Textual Data replaced any streaks of whitespace with a single space.</p>
      </sec>
      <sec id="sec-3-11">
        <title>The second, trickier, problem is the liberal usage of</title>
        <p>Tweets are rich structures. They include an id, a user id, punctuation marks. Solving this required extending the
a timestamp, the full text, the geolocation data discussed tokeniser’s default infix matcher to also match any
seabove, a list of entities and other metadata. An entity is a quence of these commonly abused punctuation marks:
character range in the full text labelled by a type (either ?!;:,."()[]{}.
url, user mention, hashtag, symbol or media) and other The third and last problem is the presence of entities
metadata. (urls, hashtags, etc.). This is where our previously inlined</p>
        <p>First, we extracted all full texts into a flat data file to be entity annotations came into play, quickly enabling us to
loaded into AntConc [25] as an aid to the downstream make the tokeniser aware of them as follows:
manual annotation process.</p>
        <p>Then, realising the entity metadata could greatly sup- • wrap all delimited regions in the text with spaces
port the tokeniser at a later stage, we inlined them into to nudge the tokeniser into correctly detecting
the full text as delimiter markers, picking a diferent pair their beginning,
for every entity type from a set of reserved Unicode code • define a custom token matcher detecting any
sepoints6. Figure 1 illustrates an example of how the pro- quence whose extrema are our delimiter character
cedure is carried out for hashtag entities. pairs, and</p>
        <p>Finally, we extracted 5.32 M tweets, keeping their id, • disable the tokeniser’s default url matcher to
user id, timestamp, full text with inlined entities, and avoid conflicts with our custom matcher.
place id. The stratagems above allowed us to execute the
to91.77 % of tweets refer to places with the IT country keniser producing a negligible amount of spurious
totchoedier;cwenetraosisdigwniethd gthoevseerntmoeItnatlaialndarteag7ioonnsadbmy imniasttcrahtiinvge wkeenrse.
pWuerethspenacfilete,rpeudreitspuonutcptuuat,tidoinsc,aprudrinegntuomkebnesrst,hbartoboundaries in order to plot choropleth maps of Italy. Of ken and/or non-existent handles (i.e., tokens beginning
the remaining tweets, 8.16 % refer to places with other with @ but not marked as entities), and all entities except
country codes and 0.07 % refer to a generic place repre- hashtags.
senting the entirety of Italy: the number of occurrences of Processing all tweets as described, we extracted
candidate forms from these two categories are included in
the choropleth maps under a legend titled “Not shown”. 71.5 M tokens, with 926 k types.
4.3. Cleanup and Tokenisation</p>
      </sec>
      <sec id="sec-3-12">
        <title>We used the spaCy v3.6.1 Italian tokeniser. However, tweets are challenging for a stock tokeniser and some issues need to be addressed.</title>
      </sec>
      <sec id="sec-3-13">
        <title>The first problem is the extensive use of Unicode (es</title>
        <p>pecially emojis), along with liberal usage of casing and
whitespace. This can be easily addressed: we replaced</p>
      </sec>
      <sec id="sec-3-14">
        <title>6We picked from the Private Use Area in the Basic Multilingual</title>
        <p>Plane, which is a set of code points left undefined by The
Unicode Consortium [26, chapter 23.5] and reserved for special custom
usage.</p>
      </sec>
      <sec id="sec-3-15">
        <title>7Oficial ISTAT data is archived at https://www.istat.it/it/archivio/</title>
        <p>222527. We used the GeoJSON version maintained by the
community, available at https://github.com/openpolis/geojson-italy/tree/
2023.1.
4.4. Candidate Selection
To select the candidates for annotation we applied two
separate strategies, producing two subsets  and ℬ with
a slight overlap as detailed in Table 2.</p>
        <p>derives from an established method in literature,
and ℬ from our attempt to reach for a more interpretable
and computationally lighter alternative. We now describe
them both in detail.
4.4.1. Subset : Spearman’s 
The first strategy follows in the steps of previous studies
[18, 7] and amounts to calculating a measure of how
monotonically the usage of a token increases in time
in order to reject tokens below a fixed threshold. The
chosen measure of monotonicity is the Spearman rank disappear before mid December and last more than four
correlation coeficient between the daily occurrences of a weeks.
token (normalised by daily total token count) and the day The specific values were tuned to cut of the markedly
number; we denote it with  . The choice of threshold heavier tails from the distributions of the respective
variis arbitrary: while the cited studies operated on multi- ables. This furthers the intention underlying our criteria
billion tweet corpora picking very restrictive thresholds to exclude the most common behaviours expected from
at 0.7 and 0.8, our corpus is much smaller so we can non-emerging forms.
aford to lower the threshold until the size of the produced Appendix D contains charts showing how  and ℬ
subset is still comfortable to annotate. We picked   &gt; partition the dataset and comparing the efect of their
0.2 selecting a subset of 4 090 candidates. defining criteria over the parameter space.</p>
        <p>However, setting a positive lower bound to   pe- Subset ℬ defined by the conditions above includes
nalises usage patterns we consider plausible for an emerg- 21 132 candidates (2.28 % of the total).
ing form (e.g., a sharp rise before midyear followed by a
slow descent to a stable non-zero plateau). Therefore, we 4.5. Annotation
chose to extend the criteria to | | &gt; 0.2 selecting 2 336
additional candidates. In other words, we are discarding
the central values of  , where it is less predictive.
Furthermore, we decided to perform the same calculation on
the daily unique users of a token; we denote the result
with   . We allowed tokens with |  | &gt; 0.2, selecting
311 additional candidates.</p>
      </sec>
      <sec id="sec-3-16">
        <title>Our decision to be so permissive, at the cost of extra annotation efort, was dictated by the intention to experimentally evaluate the efectiveness of the bounds over a wide range of threshold choices.</title>
        <sec id="sec-3-16-1">
          <title>Subset  is thus defined by the combined condi</title>
          <p>tion max(| |, |  |) &gt; 0.2, selecting 6 737 candidates
(0.73 % of the total).</p>
          <p>The subset for annotation ∪ℬ amounts to 26 890
candidates (2.90 % of the total extracted forms). To reduce the
amount of handiwork, we used a lexicon of 514 k Italian
forms specifically built for part-of-speech tagging tasks
[27] to automatically tag already attested forms as
uninteresting (including hashtags, to be analysed separately at
a later stage) and thus excluding 18 757 candidates. This
left us with 8 133 candidate forms for manual annotation,
which was performed in two stages by the second and
third author of the present paper, trained as a classicist
and a corpus linguist respectively. Firstly, we loaded the
corpus into AntConc [25] to look up each form’s context
(KWIC - KeyWord in Context format), while concurrently
cross-checking two freely available online dictionaries
and the ONLI neologisms database for attestation8. As
a result of this search, the annotators rated forms as
either innovative or non-innovative. Inter-annotator
disagreement was settled with a negotiating phase until
agreement could be reached for all forms. Examples of
discarded entries include forms attested in at least one of
the consulted dictionaries; mistypes caused by key
proximity; popular terms, e.g., bimbominchia; foreign words
well attested in the media but not in dictionaries (yet), e.g.,
foliage, spending review, sponsorship; adapted loanwords,
e.g., followo, crashare; infrequently used foreign words,
e.g., smoothie, veggie, wafle ; infrequently used foreign
acronyms, e.g., PTSD; regionalisms and regional variants,
e.g., annassero, ciolla, giargiana; gender-inclusive graphic
variants, e.g., cittadin@; nicknames, e.g., pupone for
footballer Francesco Totti, and the unfriendly portmanteau
Cessica (cesso + Jessica).</p>
          <p>Next, and as shown in Table 3, we grouped innovative
forms into one or more categories according to the ONLI
typology scheme with minor adaptations and
integrations. Specifically, we only relied on categories
referring to formal properties, and thus ignored the expressive
8Garzanti at https://www.garzantilinguistica.it/ and Treccani at
https://www.treccani.it/vocabolario/. The Slengo https://slengo.it/
urban dictionary was also used for the occasional look-up of slang
forms.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4.4.2. Subset ℬ: An Alternative Approach</title>
      <p>quantifies how much a form’s usage increases
monotonically during the year. As previously mentioned, while
this complex measure correlates with the behaviour of
some emerging forms, it also excludes plausible usage
patterns.</p>
      <p>We take the complementary approach and try instead
to formulate simple criteria to exclude usage patterns that
we would not expect from emerging forms:
• to reject accidental and sporadic phenomena (e.g.,
typos, inside jokes, etc.), we set a lower bound to
the count of unique users  and occurrences ;
• to reject forms already in use from the past, we
set a lower bound to the day of first occurrence
A;
• to reject forms disappearing early, we set a large
lower bound to the day of last occurrence Z;
• to reject ephemeral forms, we set a lower bound
to the length of the usage lapse Z − A.</p>
      <sec id="sec-4-1">
        <title>We chose the following thresholds:  &gt; 9,  &gt; 9,</title>
        <p>A &gt; 7, Z &gt; 351 and Z − A &gt; 28. They read out as: we
want forms that are used at least ten times by at least ten
people, appear from the second week of January, do not
orthographic variation
univerbation
sufixation
loanword
portmanteau
loanword adaptation
alteration
prefixation
acronym
transcategorisation
compounding
deonymic derivation
redefinition
acronymic derivation
tmesis
Total form count
109 minkiate, rix, scienzah
48 lho, miraccomando
45 cinesata, sfanculamento
40 fancam, scammer
33 gintoxic, nazipass
24 flexo, droppare
17 fattoni
8 bidosati, pregirata
6 lmv, sgp
6 cuora
3 contapalle
3 drum
2 maranza
1 efeci
1 facenza
intensified adjectives (e.g., incantevolissimissima
from incantevole);
• (adapted) loanword, chiefly borrowed from
English, with forms like flexo , loser and trollazzo;
• portmanteau, mostly relating to politics, with
words such as cessodestra, sinistronzi and the
amusing lettamaio (the combination of politicians</p>
      </sec>
      <sec id="sec-4-2">
        <title>Enrico Letta’s and Luigi Di Maio’s surnames read</title>
        <p>ing as “pigsty”), but also gintoxic and maxipass.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>5.1. Emerging Forms
emphasis category used in the ONLI: emphasis is very
common in Twitter interactions [21] and falls under all
other categories. In addition, we merged multiple ONLI
categories into one: e.g., sufissazione , sufissoide ,
deverbale and denominale were merged into sufixation , while
prefissazione and prefissoide were merged into
prefixation. Finally, a new tmesis category was added to account
for forms deriving from the splitting of compounds (e.g.,
facenza from nullafacenza). Appendix C provides the
complete list, and a machine-readable dataset of
annotated candidates is available in Franzini et al. [28].</p>
      <p>Overall, the 346 forms give insights into the most
common means by which potential new words are created
by Italian speakers. Some of these are those traditionally
detected in neologism studies: the -ata (poverata), -ismo
(cialtronismo) and -mento (sfanculamento) sufixes, for
example, are among the most common morphological
re346
sources used to derive new words from existing ones [12].</p>
      <p>However, other forms seem particularly productive as
potential sources of lexical innovation. Adapted loanwords,
for instance, draw on the broad mechanism of inclusion
of foreign verbs in the first conjugation in -are (droppare,
followo, switchare), but also on less common phenomena,
such as alteration through the sufixes -ino (trollini) or
-azzo (trollazzo). Moreover, the widespread attitude
towards evaluative language in social media interactions
is witnessed by the presence of several emphatic and
intensifying forms relying on diferent expressive means:
in addition to the superlative sufix -issimo/a applied to
verbs (adorissimo, riderissimo) or even employed as an
autonomous word, particularly noteworthy is the use of
augmentative sufixes like -one (personaggione, garone),
univerbated forms (opperbacco, eddaiii, masticazzi), or
portmanteaus such as nazipass and sinistronzi where
emphasis blends with wordplay. Indeed, ironic and catchy
wordplay frequently leads to lexical innovation and is
typical of social media conversations.</p>
      <p>Overall, a non-negligible part of the detected
innovative forms are tied to the online sphere, and, in specific
cases, are not expected to be used in diferent contexts or
to establish themselves as new Italian words (e.g., f4scista
or mer*a, which are mainly used to conceal content).
Nev• orthographic variation, often used either for em- ertheless, their emerging use in Twitter interactions
eviphasis (e.g., minkiate), to shorten existing words dences the linguistic mechanisms underlying lexical
inno(e.g., rix for risposta), to conceal online conver- vation in Italian. For each form we produce a choropleth
sation (also known as “leetspeak”, e.g., f4scist4), map showing its usage. Appendix E presents the maps
for fun (e.g., gomblotto) or for sarcasm (e.g., scien- of all emerging forms mentioned in the article, while
Figzah with a final -h expressing scepticism towards ure 2 illustrates four notable examples from diferent
catscientific advances); egories. The map of gomblotto shows that orthographic
• univerbation, with forms such as miraccomando, variation, when used for emphasis or ludic purposes, is
lho or senzapalle; widespread in almost all regions, though predominantly
• sufixation, featuring many forms ending in -ato/a in Lombardy. Conversely, when orthographic variation
(e.g., cinesata, quarantenato), -mento (e.g., sfan- is not primarily intended as a joke (e.g., poki or qndo),
culamento) or with the intensifying -issimo/a ap- the spread of new forms is not as far-reaching. Similar
plied to verbs (e.g., riderissimo) and to inherently considerations can be made for univerbated forms, which
appear to be evenly –albeit thinly– spread out with the</p>
      <sec id="sec-5-1">
        <title>The most productive categories of lexical innovation in our corpus are:</title>
        <p>gomblotto
miraccomando
flexo
fattoni
Regional i.p.m.</p>
        <p>Regional i.p.m.</p>
        <p>Regional i.p.m.</p>
        <p>Regional i.p.m.
0
10
20
30
occasional regional peak: miraccomando, for instance, is
popular in Lombardy but less so in other regions. Other
words reveal diferent patterns: the loanword flexo , for
instance, meaning “to flaunt”, is mostly used in the
western part of the country with little to no attestation in the
lower eastern regions; fattoni, an alteration of “fatto” to
denote unreliable individuals and junkies, appears to be
in use in the northern regions of Lombardy and Veneto
but not so in either the eastern part of the country or
the islands. Although, intuitively, spatial variation in
social media has diferent characteristics from traditional
geographical variation in relation to language use,
previous research has detected a broad alignment between
regional lexical variation in Twitter corpora and
traditional survey data [29]. The geographical patterns
revealed by the data, therefore, provide curious insight into
the analysis of lexical innovation in Italian.
5.2. Yields Comparison</p>
        <sec id="sec-5-1-1">
          <title>To evaluate our ℬ strategy, we compare subset ℬ’s yield</title>
          <p>with +, which is defined as the partition of  with
  &gt; 0.2, in order to fairly represent the approach of
previous studies [18, 7]. Table 4 shows the results.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>The adjusted yield, computed excluding attested forms</title>
        <p>+
and hashtags, favours . However, the projected yield,
computed including hashtags and assuming the previous
yield on them, favours ℬ.</p>
        <sec id="sec-5-2-1">
          <title>Even without hashtags, ℬ is noteworthy: its intersec</title>
          <p>tion with + yields less than the other two, indicating
non-redundancy and hence the success of ℬ in isolating
+
behaviours excluded by .</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Despite requiring five thresholds, ℬ’s are intuitively</title>
          <p>meaningful, unlike Spearman’s more abstract  .
Additionally,  is computationally expensive9, making our
approach more suitable for data exploration on weaker
machines or larger datasets.
5.3. Limitations</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>Although the one-year time frame considered is both</title>
        <p>efective in the context of Twitter, where linguistic
phenomena appear and spread in a short span of time, and
coherent with our objective to investigate the initial
emergence of new words, it could well fail to detect new forms
that spread more slowly albeit at a constant rate.</p>
        <p>Annotation with AntConc revealed the sporadic
presence of tweets in French and Spanish. These had no
impact on the identified forms but on the selection of the
subsets. However, we expect this impact to be negligible
and refrain from quantifying the efect at this time.
Conversely, the lang:it filter most likely excluded some
tweets in Italian, but no further assessment is possible
with our dataset; there is also no public information about</p>
      </sec>
      <sec id="sec-5-4">
        <title>Twitter’s proprietary language identification algorithm.</title>
      </sec>
      <sec id="sec-5-5">
        <title>Some instances of local Italian varieties were also noticed,</title>
        <p>confirming previous work [ 30], but they had no bearing
on our analysis as we discarded regionalisms.</p>
      </sec>
      <sec id="sec-5-6">
        <title>9A full-fledged time/space analysis is beyond the scope of this work,</title>
        <p>but we estimate our approach to be upwards of 50 times faster.
More details are provided in Appendix B.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>ing Machinery, New York, NY, USA, 2016, pp. 553– Name Version Webpage
562. URL: https://doi.org/10.1145/2835776.2835784. jq 1.6 jqlang.github.io/jq
doi:10.1145/2835776.2835784. GNU Parallel 20230622 gnu.org/software/make
[20] M. Laitinen, M. Fatemi, J. Lundberg, Size GNU Bash 5.1.16 gnu.org/software/bash
Matters: Digital Social Networks and Language GNU Make 4.3 gnu.org/software/parallel
Change, Frontiers in Artificial Intelligence 3 Python 3.10.8 python.org
(2020). URL: https://www.frontiersin.org/article/ NumPy 1.25.2 numpy.org
10.3389/frai.2020.00046/full. doi:10.3389/frai. SciPy 1.11.1 scipy.org
2020.00046. Pandas 2.0.3 pandas.pydata.org
[21] S. Spina, Fiumi di parole. Discorso e grammatica Modin 0.23.0 modin.readthedocs.io
delle conversazioni scritte in Twitter, Aracne, 2019. JupyterLab 4.0.4 jupyterlab.readthedocs.io
[22] D. Nguyen, A. Seza Doğruöz, C. P. Rosé, topojson 1.5 mattijn.github.io/topojson
F. De Jong, Computational Sociolinguistics: A Shapely 2.0.1 shapely.readthedocs.io
Survey, Computational Linguistics 42 (2016) 537– GeoPandas 0.13.2 geopandas.org
593. URL: https://direct.mit.edu/coli/article/42/3/ sepmaoCjiy 32..67..10 sgpitahcuyb.i.ocom/carpedm20/emoji
537-593/1536. doi:10.1162/COLI_a_00258. Matplotlib 3.7.2 matplotlib.org
[23] D. Hovy, A. Rahimi, T. Baldwin, J. Brooke, Vi- seaborn 0.12.2 seaborn.pydata.org
sualizing Regional Language Variation Across</p>
      <sec id="sec-6-1">
        <title>Europe on Twitter, in: S. D. Brunn, R. Kehrein Table 5</title>
        <p>(Eds.), Handbook of the Changing World Lan- Software and Python packages used in our data pipeline.
guage Map, Springer International Publishing,
Cham, 2019, pp. 3719–3742. URL: http://link.
springer.com/10.1007/978-3-030-02438-3_175. tics, Dubrovnik, Croatia, 2023, pp. 187–199. URL:
doi:10.1007/978-3-030-02438-3_175. https://aclanthology.org/2023.vardial-1.19. doi:10.
[24] P. Brasolin, Breviloquia italica: data pipeline, 2023. 18653/v1/2023.vardial-1.19.</p>
        <p>URL: https://doi.org/10.5281/zenodo.10010427. [31] S. Spina, P. Brasolin, G. H. Franzini, Mapping
emergdoi:10.5281/zenodo.10010427. ing vocabulary in a large corpus of italian tweets,
[25] L. Anthony, AntConc (Version 4.2.0) [Com- Research in Corpus Linguistics (in preparation).
puter Software], https://www.laurenceanthony.net/ [32] N. Zingarelli, lo Zingarelli 2022, I grandi dizionari,
software, 2022. Tokyo, Japan: Waseda University. 2022.
[26] The Unicode Consortium, The Unicode Standard,</p>
      </sec>
      <sec id="sec-6-2">
        <title>Technical Report Version 15.0.0, Unicode Consor</title>
        <p>tium, Mountain View, CA, 2022. URL: https://www.</p>
        <p>unicode.org/versions/Unicode15.0.0/. A. Data Pipeline Software Stack
[27] S. Spina, Il Perugia Corpus: una risorsa di
riferimento per l’italiano. Composizione, annotazione e The broad strokes of how we used Open Source
Softvalutazione, in: R. Basili, A. Lenci, B. Magnini (Eds.), ware to build our data pipeline are as follows: jq for bulk
Proceedings of the First Italian Conference on Com- JSONL data manipulation parallelised with GNU
Paralputational Linguistics CLiC-it 2014, volume 1, Pisa lel; NumPy, SciPy and Pandas for general data
manipuUniversity Press, Pisa, 2014, pp. 354–359. lation and analysis parallelised with Modin; JupyterLab
[28] G. H. Franzini, S. Spina, P. Brasolin, Breviloquia for data exploration; topojson, Shapely and
GeoPanitalica: annotations, 2023. URL: https://doi.org/ das for geographical data manipulation; emoji and spaCy
10.5281/zenodo.10010528. doi:10.5281/zenodo. for textual data cleanup and tokenisation; Matplotlib
and seaborn for visualisation. All logic and glue code
[29] 1J.0G0r1i0ev5e2,8C.. Montgomery, A. Nini, A. Murakami, is written using Python and GNU Bash. GNU Make is
D. Guo, Mapping Lexical Dialect Variation in British used to codify an executable dependency tree between
English using Twitter, Frontiers in Artificial Intelli- the pipeline stages, inputs and outputs. The versions of
gence 2 (2019). URL: https://www.frontiersin.org/ all stand-alone software and Python packages we used
articles/10.3389/frai.2019.00011/full. doi:10.3389/ are listed in Table 5. Indirect Python dependencies are
frai.2019.00011. listed in the requirements.txt file of [24].
[30] A. Ramponi, C. Casula, DiatopIt: A corpus of
social media posts for the study of diatopic language B. Computational Complexity
variation in Italy, in: Tenth Workshop on NLP
for Similar Languages, Varieties and Dialects (Var- A full-fledged time/space complexity analysis is beyond
Dial 2023), Association for Computational Linguis- the scope of this work, as it would require delving into
the implementation details of NumPy, SciPy, Pandas
and Modin. However, we can still provide some general
considerations and empirical measures on the behaviour
of the two proposed methods on a dataset with  columns
and  rows. In our case,  = 365 (days of the year) and
 ≃ 926 (token types).</p>
        <p>Calculating Spearman’s  for a row involves ranking
two time series and calculating their Pearson
correlation coeficient, so it is safe to assume its best-case
runtime is linear in  (and probably log-linear on average
depending on implementation details). Applying our
method to a row involves (cumulative) sums and finding
minima/maxima, so its worst-case run-time is linear in
. Naïve implementations using either method would
simply iterate on the rows of the dataset, so they have
linear run-time in .</p>
        <p>Given this rough time complexity analysis, we can
expect our method to have some advantage regardless of
implementation details. To quantify it, we ran a benchmark
abstracting the core computations of the two methods
and comparing their run-times for  = 365 and values
of  up to the scale of our dataset. The code is presented
in Figure 3 and the results are charted in Figure 4: we
observe that our method is more than 50 times faster on
bigger datasets.</p>
        <p>The benchmark was run on a single core and expressed
only as a speedup ratio to give a sense of what to
generally expect. The implementation in Brasolin [24] is
parallelised using Modin because we could run it on a
hefty Intel Xeon E5-2690 v4 CPU with 128 GB RAM:
we traded heavy memory usage for a further speedup,
essentially making data exploration in a Jupyter notebook
not only viable but pleasant. As a result, performing a
detailed space complexity analysis is a particularly delicate
matter and one that we do not address here. However,
we should stress that our alternative method was
initially developed because our means at the outset were
much more limited (memory in particular proved to be
a bottleneck at 16 GB), and that the initial, sequential,
memory-aware implementation is still present in a
comment alongside the parallelised one for use on smaller
machines.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>C. Full List of Innovative Forms</title>
      <p>102 103
Dataset rows (r)
104
105</p>
      <p>D. Comparison Charts for  and ℬ</p>
      <sec id="sec-7-1">
        <title>See Figure 6.</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>E. Choropleth Maps of Examples</title>
      <sec id="sec-8-1">
        <title>See Figures 7 and 8.</title>
        <p>1.0
0.5
O
s
e
cen 0.0
r
ru
c
c
O
103
101
Full dataset
U &gt; 9, O &gt; 9
Subset
Full dataset
A &gt; 7,
Z &lt; 354,
Z A &gt; 28
Subset
Full dataset
Subset
O &gt; 0.2
103</p>
        <p>106
Full dataset
Subset
O &gt; 0.2
Not shown:
4 foreign</p>
        <p>Not shown:
7 foreign
million tokens at the regional level. Total occurrences are provided with the titles, foreign ones in the legends. We omit f4scist4
as it occurs outside of Italy only.
2
1
0
8
6
4
2
0
40
20
0
40
20
0
10
0
75
50
25
0
200
100
0
8
6
4
2
0
1.0
0.0</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>maranza (104) minkiate (245) 2.0 1.5 0</source>
          .
          <fpage>5</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>