1. Introduction

“Ti blocco perché sei un trollazzo”. Lexical Innovation in Contemporary Italian in a Large Twitter Corpus

Paolo Brasolin

Greta H. Franzini

Stefania Spina

0 1 0 Eurac Research (Institute for Applied Linguistics) , Viale Druso 1, 39100 Bolzano BZ , Italy 1 University for Foreigners of Perugia , Piazza Fortebraccio 4, 06123 Perugia PG , Italy

93 105

This study investigates emerging vocabulary in contemporary Italian in a corpus of 5.32 M timestamped and geotagged tweets extracted from the Italian timeline throughout 2022. We automatically identify and manually distill 8 133 candidate neologisms down to 346 unattested word forms, shedding light on their spatio-temporal circulation patterns.

eol>twitter social media corpora italian lexical innovation language change

1. Introduction

rary Italian stemming from Twitter interactions using the 2022 Italian timeline as a source; social media represents Lexical innovation is one of the driving mechanisms of an opportunity to analyse new word forms surfacing language change [1, 2]: through the creation of new in everyday conversation, and provide vast amounts of words1 and their integration into existing lexical sys- data produced in real time by a large, heterogeneous tems [3], languages evolve and adapt to new social and and representative sample of speakers. Furthermore, the technological contexts, which are constantly and rapidly availability of geotagged texts enables the investigation changing. The process of creating new words can be ap- of possible patterns of lexical innovation related to speproached from diferent standpoints. Firstly, the choice cific geographical areas [ 7]. This possibility is particuof sources necessary to trace the process of lexical in- larly promising in languages, like Italian, characterised novation has great methodological relevance. One of by deep and articulated geographical variation. On the the main traditional sources have been newspaper texts, other hand, we propose a novel methodology to process which have the double benefit of being easily available and filter word forms acquired from a sizeable Twitter and quantitatively relevant [4]. Secondly, lexical innova- corpus, with the aim of detecting those that represent tion follows diferent steps and usually develops from the the best candidates to become new words. initial emergence of new words in specific contexts to The result of the study is a list of 346 word forms, clastheir proliferation to wider contexts and domains. This sified into 15 categories based on the linguistic process process may end with the institutionalisation of new of lexical creation and yet unattested in two of the most word forms [5, 6] through their inclusion in dictionaries up-to-date Italian lexicographic resources. and consolidation in standard use. Thirdly, the linguistic processes leading to the creation of new words can be diferent and can include phenomena of derivation, 2. Related Work composition, transcategorisation, creation of portmanteau forms, semantic shifts, and borrowing from other languages.

The aim of this study is twofold. On the one hand, we present an analysis of emerging vocabulary in contempo Studies on lexical innovation in Italian have a long tra

dition [8], and have produced extensive lexicographic works dedicated to neologisms (e.g., [9], to mention one of the most recent), as well as a vast body of research (e.g., [10], [11] and [12]). One of the most widely discussed topics is the classification of the linguistic processes leading CNLoviC3-0it—20D2e3c: 092th,2I0t2a3li,aVnenCiocne,feItraelnyce on Computational Linguistics, to the creation and spread of new words. * Corresponding author. Traditionally, it is acknowledged that the means by $ paolo.brasolin@eurac.edu (P. Brasolin); which languages enrich their vocabulary are essentially greta.franzini@eurac.edu (G. H. Franzini); four: the acquisition of new elements from other lanstefania.spina@unistrapg.it (S. Spina) guages, the formation of new words from pre-existing h00tt0p0s-:0//0p0a3o-2lo4b7r1a-s7o7l9in7.g(Pit.hBurba.sioo/li(nP).; 0B0r0a0so-0li0n0)3-1159-5575 lexical elements, the change of grammatical category and (G. H. Franzini); 0000-0002-9957-3903 (S. Spina) the shift in the meaning of words already in use [13]. In © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License the last few decades, the Osservatorio neologico della lin

CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) 1In this paper, “word” and “form” are used interchangeably.

gua italiana2 (ONLI) [4] has been tracking new words emerging in Italian newspapers, producing a database which, to date, includes 2 986 forms with definition, date of attestation and first retrieved occurrence in the press.

More recently, several studies have highlighted the benefits of using social media to track new word forms cropping up in informal contexts, such as everyday conversation, as opposed to newspaper texts, which are more formal and draw from diferent registers [ 14, 15, 16]. Additionally, as a populous repository of conversations held in real time by a large number of speakers, social media can capture lexical creativity originating in communities of people rather than inventive journalism [17]. This use of social media has produced a number of studies [18, 7, 19] focussed on the initial and less documented phase of the lexical innovation process, right after the words’ creation and first use, and well before their final institutionalisation and inclusion in dictionaries [5, 6].

It is well-known that only a small portion of the words coined in everyday language use become new entries in dictionaries and thus part of the vocabulary: many remain ephemeral but are nevertheless compelling, as they provide evidence of the linguistic mechanisms driving the lexical innovation process. Generally, social media allow researchers to extract and use an unprecedented amount of conversational data [20, 21], which can provide reliable computations of lexical innovation and thus give a significant boost to the study of language variation and change [22, 23].

3. Corpus In order to investigate emerging vocabulary in contem

porary Italian, we used a corpus of timestamped and geotagged tweets extracted from the Italian Twitter timeline throughout 2022. The corpus comprises 5.32 M tweets written by 153 k unique users, amounting to 71.5 M tokens (or 564 M characters).

To the best of our knowledge, this is the first and

largest study yet to address lexical innovation in Italian

Twitter. Regrettably, this could also be the last. The re

cent takeover of Twitter collapsed its value for academia: as of summer 2023, publicly accessible data has been severely restricted, API prices have sharply risen, and academic access has been cancelled outright.

4. Methodology Manual annotation aside, all our procedures are imple

mented as code and organised into a series of modular stages. To facilitate operation, they are accompanied and

2https://www.iliesi.cnr.it/ONLI/intro.php

Condition coordinated by an executable dependency tree specifying the relations between them, their inputs and their outputs. Together, they constitute a cohesive and reproducible data pipeline.

We exclusively used Open Source Software, mostly

in the form of well-known Python packages and GNU3 tools. An exhaustive list including version numbers can be found in Appendix A.

In the following, we only discuss the general implementation design. The full source code is documented and available in [24].

4.1. Acquisition Our corpus samples the Italian Twitter timeline of 2022.

We define this notion as the conjunction of the conditions

listed in Table 1, expressed using Twitter’s advanced search query language4.

Thus, our corpus is a subset of the results given by the search combining the aforementioned conditions at the time of sampling.

4.2. Preparation 4.2.1. Geographic Data Tweets can bear geolocation data in two independent forms: a latitude/longitude pair and an association with a place. A place is an administrative division or a point of interest and it is characterised by an id, a country code, a geographical bounding box and other metadata. In our corpus, 99.43 % of tweets bear a place, 0.04 % only bear a lat./long. pair, and 0.53 % bear neither5. Consequently, despite lat./long. pairs being more precise, we chose to deal with places only, as they cover the vast majority of tweets and already include the country code necessary to restrict the data exactly to Italy.

We extracted 34.8 k unique places, keeping their id

and country code (47.0 % are IT), and computed the

3https://www.gnu.org/ 4Extensive unoficial documentation for the query language is avail

able at https://github.com/igorbrigadir/twitter-advanced-search/. The user interface is found at https://www.twitter.com/ search-advanced.

5This is possible because Twitter data can be redacted.

"Hi #twitter !" ↦→ range of hashtag entity "Hi □ #twitter □ !"

U+E000 U+E001 centroid of their bounding box as a reference point for geographical calculations. all emojis with spaces, lowercased the whole text, and 4.2.2. Textual Data replaced any streaks of whitespace with a single space.

The second, trickier, problem is the liberal usage of

Tweets are rich structures. They include an id, a user id, punctuation marks. Solving this required extending the a timestamp, the full text, the geolocation data discussed tokeniser’s default infix matcher to also match any seabove, a list of entities and other metadata. An entity is a quence of these commonly abused punctuation marks: character range in the full text labelled by a type (either ?!;:,."()[]{}. url, user mention, hashtag, symbol or media) and other The third and last problem is the presence of entities metadata. (urls, hashtags, etc.). This is where our previously inlined

First, we extracted all full texts into a flat data file to be entity annotations came into play, quickly enabling us to loaded into AntConc [25] as an aid to the downstream make the tokeniser aware of them as follows: manual annotation process.

Then, realising the entity metadata could greatly sup- • wrap all delimited regions in the text with spaces port the tokeniser at a later stage, we inlined them into to nudge the tokeniser into correctly detecting the full text as delimiter markers, picking a diferent pair their beginning, for every entity type from a set of reserved Unicode code • define a custom token matcher detecting any sepoints6. Figure 1 illustrates an example of how the pro- quence whose extrema are our delimiter character cedure is carried out for hashtag entities. pairs, and

Finally, we extracted 5.32 M tweets, keeping their id, • disable the tokeniser’s default url matcher to user id, timestamp, full text with inlined entities, and avoid conflicts with our custom matcher. place id. The stratagems above allowed us to execute the to91.77 % of tweets refer to places with the IT country keniser producing a negligible amount of spurious totchoedier;cwenetraosisdigwniethd gthoevseerntmoeItnatlaialndarteag7ioonnsadbmy imniasttcrahtiinvge wkeenrse. pWuerethspenacfilete,rpeudreitspuonutcptuuat,tidoinsc,aprudrinegntuomkebnesrst,hbartoboundaries in order to plot choropleth maps of Italy. Of ken and/or non-existent handles (i.e., tokens beginning the remaining tweets, 8.16 % refer to places with other with @ but not marked as entities), and all entities except country codes and 0.07 % refer to a generic place repre- hashtags. senting the entirety of Italy: the number of occurrences of Processing all tweets as described, we extracted candidate forms from these two categories are included in the choropleth maps under a legend titled “Not shown”. 71.5 M tokens, with 926 k types. 4.3. Cleanup and Tokenisation

We used the spaCy v3.6.1 Italian tokeniser. However, tweets are challenging for a stock tokeniser and some issues need to be addressed. The first problem is the extensive use of Unicode (es

pecially emojis), along with liberal usage of casing and whitespace. This can be easily addressed: we replaced

6We picked from the Private Use Area in the Basic Multilingual

Plane, which is a set of code points left undefined by The Unicode Consortium [26, chapter 23.5] and reserved for special custom usage.

7Oficial ISTAT data is archived at https://www.istat.it/it/archivio/

222527. We used the GeoJSON version maintained by the community, available at https://github.com/openpolis/geojson-italy/tree/ 2023.1. 4.4. Candidate Selection To select the candidates for annotation we applied two separate strategies, producing two subsets and ℬ with a slight overlap as detailed in Table 2.

derives from an established method in literature, and ℬ from our attempt to reach for a more interpretable and computationally lighter alternative. We now describe them both in detail. 4.4.1. Subset : Spearman’s The first strategy follows in the steps of previous studies [18, 7] and amounts to calculating a measure of how monotonically the usage of a token increases in time in order to reject tokens below a fixed threshold. The chosen measure of monotonicity is the Spearman rank disappear before mid December and last more than four correlation coeficient between the daily occurrences of a weeks. token (normalised by daily total token count) and the day The specific values were tuned to cut of the markedly number; we denote it with . The choice of threshold heavier tails from the distributions of the respective variis arbitrary: while the cited studies operated on multi- ables. This furthers the intention underlying our criteria billion tweet corpora picking very restrictive thresholds to exclude the most common behaviours expected from at 0.7 and 0.8, our corpus is much smaller so we can non-emerging forms. aford to lower the threshold until the size of the produced Appendix D contains charts showing how and ℬ subset is still comfortable to annotate. We picked > partition the dataset and comparing the efect of their 0.2 selecting a subset of 4 090 candidates. defining criteria over the parameter space.

However, setting a positive lower bound to pe- Subset ℬ defined by the conditions above includes nalises usage patterns we consider plausible for an emerg- 21 132 candidates (2.28 % of the total). ing form (e.g., a sharp rise before midyear followed by a slow descent to a stable non-zero plateau). Therefore, we 4.5. Annotation chose to extend the criteria to | | > 0.2 selecting 2 336 additional candidates. In other words, we are discarding the central values of , where it is less predictive. Furthermore, we decided to perform the same calculation on the daily unique users of a token; we denote the result with . We allowed tokens with | | > 0.2, selecting 311 additional candidates.

Our decision to be so permissive, at the cost of extra annotation efort, was dictated by the intention to experimentally evaluate the efectiveness of the bounds over a wide range of threshold choices. Subset is thus defined by the combined condi

tion max(| |, | |) > 0.2, selecting 6 737 candidates (0.73 % of the total).

The subset for annotation ∪ℬ amounts to 26 890 candidates (2.90 % of the total extracted forms). To reduce the amount of handiwork, we used a lexicon of 514 k Italian forms specifically built for part-of-speech tagging tasks [27] to automatically tag already attested forms as uninteresting (including hashtags, to be analysed separately at a later stage) and thus excluding 18 757 candidates. This left us with 8 133 candidate forms for manual annotation, which was performed in two stages by the second and third author of the present paper, trained as a classicist and a corpus linguist respectively. Firstly, we loaded the corpus into AntConc [25] to look up each form’s context (KWIC - KeyWord in Context format), while concurrently cross-checking two freely available online dictionaries and the ONLI neologisms database for attestation8. As a result of this search, the annotators rated forms as either innovative or non-innovative. Inter-annotator disagreement was settled with a negotiating phase until agreement could be reached for all forms. Examples of discarded entries include forms attested in at least one of the consulted dictionaries; mistypes caused by key proximity; popular terms, e.g., bimbominchia; foreign words well attested in the media but not in dictionaries (yet), e.g., foliage, spending review, sponsorship; adapted loanwords, e.g., followo, crashare; infrequently used foreign words, e.g., smoothie, veggie, wafle ; infrequently used foreign acronyms, e.g., PTSD; regionalisms and regional variants, e.g., annassero, ciolla, giargiana; gender-inclusive graphic variants, e.g., cittadin@; nicknames, e.g., pupone for footballer Francesco Totti, and the unfriendly portmanteau Cessica (cesso + Jessica).

Next, and as shown in Table 3, we grouped innovative forms into one or more categories according to the ONLI typology scheme with minor adaptations and integrations. Specifically, we only relied on categories referring to formal properties, and thus ignored the expressive 8Garzanti at https://www.garzantilinguistica.it/ and Treccani at https://www.treccani.it/vocabolario/. The Slengo https://slengo.it/ urban dictionary was also used for the occasional look-up of slang forms.

4.4.2. Subset ℬ: An Alternative Approach

quantifies how much a form’s usage increases monotonically during the year. As previously mentioned, while this complex measure correlates with the behaviour of some emerging forms, it also excludes plausible usage patterns.

We take the complementary approach and try instead to formulate simple criteria to exclude usage patterns that we would not expect from emerging forms: • to reject accidental and sporadic phenomena (e.g., typos, inside jokes, etc.), we set a lower bound to the count of unique users and occurrences ; • to reject forms already in use from the past, we set a lower bound to the day of first occurrence A; • to reject forms disappearing early, we set a large lower bound to the day of last occurrence Z; • to reject ephemeral forms, we set a lower bound to the length of the usage lapse Z − A.

We chose the following thresholds: > 9, > 9,

A > 7, Z > 351 and Z − A > 28. They read out as: we want forms that are used at least ten times by at least ten people, appear from the second week of January, do not orthographic variation univerbation sufixation loanword portmanteau loanword adaptation alteration prefixation acronym transcategorisation compounding deonymic derivation redefinition acronymic derivation tmesis Total form count 109 minkiate, rix, scienzah 48 lho, miraccomando 45 cinesata, sfanculamento 40 fancam, scammer 33 gintoxic, nazipass 24 flexo, droppare 17 fattoni 8 bidosati, pregirata 6 lmv, sgp 6 cuora 3 contapalle 3 drum 2 maranza 1 efeci 1 facenza intensified adjectives (e.g., incantevolissimissima from incantevole); • (adapted) loanword, chiefly borrowed from English, with forms like flexo , loser and trollazzo; • portmanteau, mostly relating to politics, with words such as cessodestra, sinistronzi and the amusing lettamaio (the combination of politicians

Enrico Letta’s and Luigi Di Maio’s surnames read

ing as “pigsty”), but also gintoxic and maxipass.

5. Results and Discussion

5.1. Emerging Forms emphasis category used in the ONLI: emphasis is very common in Twitter interactions [21] and falls under all other categories. In addition, we merged multiple ONLI categories into one: e.g., sufissazione , sufissoide , deverbale and denominale were merged into sufixation , while prefissazione and prefissoide were merged into prefixation. Finally, a new tmesis category was added to account for forms deriving from the splitting of compounds (e.g., facenza from nullafacenza). Appendix C provides the complete list, and a machine-readable dataset of annotated candidates is available in Franzini et al. [28].

Overall, the 346 forms give insights into the most common means by which potential new words are created by Italian speakers. Some of these are those traditionally detected in neologism studies: the -ata (poverata), -ismo (cialtronismo) and -mento (sfanculamento) sufixes, for example, are among the most common morphological re346 sources used to derive new words from existing ones [12].

However, other forms seem particularly productive as potential sources of lexical innovation. Adapted loanwords, for instance, draw on the broad mechanism of inclusion of foreign verbs in the first conjugation in -are (droppare, followo, switchare), but also on less common phenomena, such as alteration through the sufixes -ino (trollini) or -azzo (trollazzo). Moreover, the widespread attitude towards evaluative language in social media interactions is witnessed by the presence of several emphatic and intensifying forms relying on diferent expressive means: in addition to the superlative sufix -issimo/a applied to verbs (adorissimo, riderissimo) or even employed as an autonomous word, particularly noteworthy is the use of augmentative sufixes like -one (personaggione, garone), univerbated forms (opperbacco, eddaiii, masticazzi), or portmanteaus such as nazipass and sinistronzi where emphasis blends with wordplay. Indeed, ironic and catchy wordplay frequently leads to lexical innovation and is typical of social media conversations.

Overall, a non-negligible part of the detected innovative forms are tied to the online sphere, and, in specific cases, are not expected to be used in diferent contexts or to establish themselves as new Italian words (e.g., f4scista or mer*a, which are mainly used to conceal content). Nev• orthographic variation, often used either for em- ertheless, their emerging use in Twitter interactions eviphasis (e.g., minkiate), to shorten existing words dences the linguistic mechanisms underlying lexical inno(e.g., rix for risposta), to conceal online conver- vation in Italian. For each form we produce a choropleth sation (also known as “leetspeak”, e.g., f4scist4), map showing its usage. Appendix E presents the maps for fun (e.g., gomblotto) or for sarcasm (e.g., scien- of all emerging forms mentioned in the article, while Figzah with a final -h expressing scepticism towards ure 2 illustrates four notable examples from diferent catscientific advances); egories. The map of gomblotto shows that orthographic • univerbation, with forms such as miraccomando, variation, when used for emphasis or ludic purposes, is lho or senzapalle; widespread in almost all regions, though predominantly • sufixation, featuring many forms ending in -ato/a in Lombardy. Conversely, when orthographic variation (e.g., cinesata, quarantenato), -mento (e.g., sfan- is not primarily intended as a joke (e.g., poki or qndo), culamento) or with the intensifying -issimo/a ap- the spread of new forms is not as far-reaching. Similar plied to verbs (e.g., riderissimo) and to inherently considerations can be made for univerbated forms, which appear to be evenly –albeit thinly– spread out with the

The most productive categories of lexical innovation in our corpus are:

gomblotto miraccomando flexo fattoni Regional i.p.m.

Regional i.p.m.

Regional i.p.m. 0 10 20 30 occasional regional peak: miraccomando, for instance, is popular in Lombardy but less so in other regions. Other words reveal diferent patterns: the loanword flexo , for instance, meaning “to flaunt”, is mostly used in the western part of the country with little to no attestation in the lower eastern regions; fattoni, an alteration of “fatto” to denote unreliable individuals and junkies, appears to be in use in the northern regions of Lombardy and Veneto but not so in either the eastern part of the country or the islands. Although, intuitively, spatial variation in social media has diferent characteristics from traditional geographical variation in relation to language use, previous research has detected a broad alignment between regional lexical variation in Twitter corpora and traditional survey data [29]. The geographical patterns revealed by the data, therefore, provide curious insight into the analysis of lexical innovation in Italian. 5.2. Yields Comparison

To evaluate our ℬ strategy, we compare subset ℬ’s yield

with +, which is defined as the partition of with > 0.2, in order to fairly represent the approach of previous studies [18, 7]. Table 4 shows the results.

The adjusted yield, computed excluding attested forms

+ and hashtags, favours . However, the projected yield, computed including hashtags and assuming the previous yield on them, favours ℬ.

Even without hashtags, ℬ is noteworthy: its intersec

tion with + yields less than the other two, indicating non-redundancy and hence the success of ℬ in isolating + behaviours excluded by .

Despite requiring five thresholds, ℬ’s are intuitively

meaningful, unlike Spearman’s more abstract . Additionally, is computationally expensive9, making our approach more suitable for data exploration on weaker machines or larger datasets. 5.3. Limitations

Although the one-year time frame considered is both

efective in the context of Twitter, where linguistic phenomena appear and spread in a short span of time, and coherent with our objective to investigate the initial emergence of new words, it could well fail to detect new forms that spread more slowly albeit at a constant rate.

Annotation with AntConc revealed the sporadic presence of tweets in French and Spanish. These had no impact on the identified forms but on the selection of the subsets. However, we expect this impact to be negligible and refrain from quantifying the efect at this time. Conversely, the lang:it filter most likely excluded some tweets in Italian, but no further assessment is possible with our dataset; there is also no public information about

Twitter’s proprietary language identification algorithm. Some instances of local Italian varieties were also noticed,

confirming previous work [ 30], but they had no bearing on our analysis as we discarded regionalisms.

9A full-fledged time/space analysis is beyond the scope of this work,

but we estimate our approach to be upwards of 50 times faster. More details are provided in Appendix B.

6. Conclusions and Future Work

ing Machinery, New York, NY, USA, 2016, pp. 553– Name Version Webpage 562. URL: https://doi.org/10.1145/2835776.2835784. jq 1.6 jqlang.github.io/jq doi:10.1145/2835776.2835784. GNU Parallel 20230622 gnu.org/software/make [20] M. Laitinen, M. Fatemi, J. Lundberg, Size GNU Bash 5.1.16 gnu.org/software/bash Matters: Digital Social Networks and Language GNU Make 4.3 gnu.org/software/parallel Change, Frontiers in Artificial Intelligence 3 Python 3.10.8 python.org (2020). URL: https://www.frontiersin.org/article/ NumPy 1.25.2 numpy.org 10.3389/frai.2020.00046/full. doi:10.3389/frai. SciPy 1.11.1 scipy.org 2020.00046. Pandas 2.0.3 pandas.pydata.org [21] S. Spina, Fiumi di parole. Discorso e grammatica Modin 0.23.0 modin.readthedocs.io delle conversazioni scritte in Twitter, Aracne, 2019. JupyterLab 4.0.4 jupyterlab.readthedocs.io [22] D. Nguyen, A. Seza Doğruöz, C. P. Rosé, topojson 1.5 mattijn.github.io/topojson F. De Jong, Computational Sociolinguistics: A Shapely 2.0.1 shapely.readthedocs.io Survey, Computational Linguistics 42 (2016) 537– GeoPandas 0.13.2 geopandas.org 593. URL: https://direct.mit.edu/coli/article/42/3/ sepmaoCjiy 32..67..10 sgpitahcuyb.i.ocom/carpedm20/emoji 537-593/1536. doi:10.1162/COLI_a_00258. Matplotlib 3.7.2 matplotlib.org [23] D. Hovy, A. Rahimi, T. Baldwin, J. Brooke, Vi- seaborn 0.12.2 seaborn.pydata.org sualizing Regional Language Variation Across

Europe on Twitter, in: S. D. Brunn, R. Kehrein Table 5

(Eds.), Handbook of the Changing World Lan- Software and Python packages used in our data pipeline. guage Map, Springer International Publishing, Cham, 2019, pp. 3719–3742. URL: http://link. springer.com/10.1007/978-3-030-02438-3_175. tics, Dubrovnik, Croatia, 2023, pp. 187–199. URL: doi:10.1007/978-3-030-02438-3_175. https://aclanthology.org/2023.vardial-1.19. doi:10. [24] P. Brasolin, Breviloquia italica: data pipeline, 2023. 18653/v1/2023.vardial-1.19.

URL: https://doi.org/10.5281/zenodo.10010427. [31] S. Spina, P. Brasolin, G. H. Franzini, Mapping emergdoi:10.5281/zenodo.10010427. ing vocabulary in a large corpus of italian tweets, [25] L. Anthony, AntConc (Version 4.2.0) [Com- Research in Corpus Linguistics (in preparation). puter Software], https://www.laurenceanthony.net/ [32] N. Zingarelli, lo Zingarelli 2022, I grandi dizionari, software, 2022. Tokyo, Japan: Waseda University. 2022. [26] The Unicode Consortium, The Unicode Standard,

Technical Report Version 15.0.0, Unicode Consor

tium, Mountain View, CA, 2022. URL: https://www.

unicode.org/versions/Unicode15.0.0/. A. Data Pipeline Software Stack [27] S. Spina, Il Perugia Corpus: una risorsa di riferimento per l’italiano. Composizione, annotazione e The broad strokes of how we used Open Source Softvalutazione, in: R. Basili, A. Lenci, B. Magnini (Eds.), ware to build our data pipeline are as follows: jq for bulk Proceedings of the First Italian Conference on Com- JSONL data manipulation parallelised with GNU Paralputational Linguistics CLiC-it 2014, volume 1, Pisa lel; NumPy, SciPy and Pandas for general data manipuUniversity Press, Pisa, 2014, pp. 354–359. lation and analysis parallelised with Modin; JupyterLab [28] G. H. Franzini, S. Spina, P. Brasolin, Breviloquia for data exploration; topojson, Shapely and GeoPanitalica: annotations, 2023. URL: https://doi.org/ das for geographical data manipulation; emoji and spaCy 10.5281/zenodo.10010528. doi:10.5281/zenodo. for textual data cleanup and tokenisation; Matplotlib and seaborn for visualisation. All logic and glue code [29] 1J.0G0r1i0ev5e2,8C.. Montgomery, A. Nini, A. Murakami, is written using Python and GNU Bash. GNU Make is D. Guo, Mapping Lexical Dialect Variation in British used to codify an executable dependency tree between English using Twitter, Frontiers in Artificial Intelli- the pipeline stages, inputs and outputs. The versions of gence 2 (2019). URL: https://www.frontiersin.org/ all stand-alone software and Python packages we used articles/10.3389/frai.2019.00011/full. doi:10.3389/ are listed in Table 5. Indirect Python dependencies are frai.2019.00011. listed in the requirements.txt file of [24]. [30] A. Ramponi, C. Casula, DiatopIt: A corpus of social media posts for the study of diatopic language B. Computational Complexity variation in Italy, in: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (Var- A full-fledged time/space complexity analysis is beyond Dial 2023), Association for Computational Linguis- the scope of this work, as it would require delving into the implementation details of NumPy, SciPy, Pandas and Modin. However, we can still provide some general considerations and empirical measures on the behaviour of the two proposed methods on a dataset with columns and rows. In our case, = 365 (days of the year) and ≃ 926 (token types).

Calculating Spearman’s for a row involves ranking two time series and calculating their Pearson correlation coeficient, so it is safe to assume its best-case runtime is linear in (and probably log-linear on average depending on implementation details). Applying our method to a row involves (cumulative) sums and finding minima/maxima, so its worst-case run-time is linear in . Naïve implementations using either method would simply iterate on the rows of the dataset, so they have linear run-time in .

Given this rough time complexity analysis, we can expect our method to have some advantage regardless of implementation details. To quantify it, we ran a benchmark abstracting the core computations of the two methods and comparing their run-times for = 365 and values of up to the scale of our dataset. The code is presented in Figure 3 and the results are charted in Figure 4: we observe that our method is more than 50 times faster on bigger datasets.

The benchmark was run on a single core and expressed only as a speedup ratio to give a sense of what to generally expect. The implementation in Brasolin [24] is parallelised using Modin because we could run it on a hefty Intel Xeon E5-2690 v4 CPU with 128 GB RAM: we traded heavy memory usage for a further speedup, essentially making data exploration in a Jupyter notebook not only viable but pleasant. As a result, performing a detailed space complexity analysis is a particularly delicate matter and one that we do not address here. However, we should stress that our alternative method was initially developed because our means at the outset were much more limited (memory in particular proved to be a bottleneck at 16 GB), and that the initial, sequential, memory-aware implementation is still present in a comment alongside the parallelised one for use on smaller machines.

C. Full List of Innovative Forms

102 103 Dataset rows (r) 104 105

D. Comparison Charts for and ℬ

See Figure 6. E. Choropleth Maps of Examples See Figures 7 and 8.

1.0 0.5 O s e cen 0.0 r ru c c O 103 101 Full dataset U > 9, O > 9 Subset Full dataset A > 7, Z < 354, Z A > 28 Subset Full dataset Subset O > 0.2 103

106 Full dataset Subset O > 0.2 Not shown: 4 foreign

Not shown: 7 foreign million tokens at the regional level. Total occurrences are provided with the titles, foreign ones in the legends. We omit f4scist4 as it occurs outside of Italy only. 2 1 0 8 6 4 2 0 40 20 0 40 20 0 10 0 75 50 25 0 200 100 0 8 6 4 2 0 1.0 0.0

maranza (104) minkiate (245) 2.0 1.5 0 . 5