Building an Italian Written-Spoken Parallel Corpus: a Pilot Study Elisa Dominutti, Lucia Pifferi Felice Dell’Orletta, Simonetta Montemagni Università di Pisa Valeria Quochi elisa.dominutti@gmail.com ILC–CNR luciapiff@gmail.com name.surnamen@ilc.cnr.it Abstract ing the written language. Nencioni (1976) quali- fies this variety of language use as parlato-scritto This paper presents a pilot study towards (‘spoken-written’), a label that emphasises its hy- the creation of a monolingual written– brid nature characterised by the co-occurrence of spoken parallel corpus in Italian, featur- traits typical of both written and spoken language. ing two main novelties in the general From a different perspective, Ong (1982) refers to landscape of spoken corpora: the align- this variety as ‘secondary orality’, i.e. “an oral- ment with the written counterpart of the ity not antecedent to writing and print, as primary same content and the spoken variety dealt orality is, but consequent and dependent upon with, represented by transcriptions of ra- writing and print”. dio news broadcasting. In addition to this socio-linguistic interest, the issue also bears relevance for computational ap- 1 Introduction proaches as it has a substantial impact on the per- Nowadays, the contrast between written and spo- ceived naturalness of human-machine interaction. ken language does no longer represent a clear-cut Indeed, one of the reasons why speech synthesis opposition. The emergence of modern communi- applications still produce unnatural speech, apart cation technologies such as radio, television and from bad prosody is that written language is gen- new (digital) media led to important changes in erally not suitable, i.e. comprehensible, direct and the analysis of the diamesic variation. Under this effective, in spoken contexts (Kaji et al., 2004). view, the opposition spoken vs. written language With the rise and quick spread of Virtual Reality is reformulated in terms of a continuum with pro- (VR) and Augmented-Reality (AR) applications, totypical written and spoken language at the ex- moreover, the mismatch between written and spo- treme poles and within which a cline of interme- ken language styles brings about serious techno- diate linguistic varieties can be recognised, mix- logical limitations because unnaturalness of the ing, to a different extent, features of the two. Nen- virtual agents translates into bad human compre- cioni (1976) defined the extreme poles of this con- hension and/or distrust in those agents altogether. tinuum as the parlato-parlato (‘spoken-spoken’) It is thus no longer sufficient to pass a written mes- variety, i.e. casual, spontaneous conversation, sage to the speech synthesizer, but such a mes- and the scritto-scritto (‘written-written’) variety, sage needs to be transformed in a form suitable i.e. planned, formal, written language. Besides to be spoken in the specific context of use. In or- the typical contexts envisaging the use of spo- der to be able to do this, corpus data is needed ken language—which require all participants to such as a monolingual parallel aligned corpus of be present in the same environment, that the con- written and spoken texts about the same content. versation is held in turns and that speakers make A corpus designed in this way is of fundamen- sure their messages are getting across—different tal importance for: a) investigating the features contexts can be imagined: among them, the radio of the parlato-scritto language variety, its simi- and television language which, despite being spo- larities and differences with respect to the written ken, present traces of textual organisation recall- language; and b) for creating the prerequisites for the design and development of tools for monitor- Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 ing the communicative effectiveness of texts with International (CC BY 4.0). respect to their production mode and for support- ing the semi-automatic generation or transforma- different predefined criteria. Interestingly, in their tion of texts to be delivered orally. Such a cor- experiments both on sentence compression and on pus represents an important novel contribution in transformation from written language to spoken the area of language corpora; generally in fact language they manage to apply the same algorithm corpora target either written or spoken language. applied to different data an dobtain good results. Some corpora indeed also include sections with For the latter experiment, they used a monolin- transcriptions of spoken language: see for instance gual parallel corpus of academic papers and tran- the Brown corpus for English. On the front of spo- scripts of oral presentations and built a system that ken corpora, large corpora of spoken Italian were learns re-writing rules according to the defined cri- produced, some aiming at specific purposes, like teria. In the former case re-writing rules were CiT (Corpus di Italiano Trasmesso) (Spina, 2000) learnt from dictionaries. or LIR (Maraschio et al., 2004), while others aim- Kaji and colleagues (2004; 2005) worked on ing at representing Italian in a wider perspective the transformation of written language to spoken like C-ORAL-ROM (Cresti and Moneglia, 2005). language style in Japanese, approaching the is- Some of them take into account only a few aspects sue as a lexical paraphrasing problem, for which of the linguistic variability, mainly the diaphasic they constructed an ad-hoc written–spoken web and in some cases diamesic dimension. corpora focused on the connotational differences Our Corpus Italiano Parallelo Parlato Scritto related to the suitability for orality of expressions. (‘Spoken Written Italian Parallel Corpus’, hence- Their method learns predicate paraphrases from a forth CIPPS) features two fundamental novelties dictionary and then uses the corpus to statistically in the general landscape of spoken corpora: the determine whether an expression is suitable to be alignment with a written counterpart of the same spoken. content and the type of spoken variety dealt with. More recently, Matsubara and Hayashi (2012) report about an application for generating sponta- 2 Background and related works neous news speech in a news speech delivery ser- Notwithstanding the differences between written vice. They approach the issue as a text genera- and spoken language styles and the impact it tion task and develop a rule-based system for au- bears on human-machine interaction, little compu- tomatically generating news speech scripts—to be tational work has been devoted to develop data and read via speech synthesis—starting from newspa- methods for “transforming” a written text in a text per articles. Their approach however focuses on suitable for a specific spoken context. a specific stylistic difference peculiar to Japanese Previous works mostly deal with the trans- hardly portable to other languages and does not in- formation of spoken language into grammati- volve any kind of parallel aligned data. cally valid, correct written language that can be parsed by standard NLP tools—see for instance 3 Pilot corpus creation Marimuthu and Devi (2014) and Giuliani et al. In this work we describe our first attempts at build- (2014). However, the rise and spread of VR and ing a parallel written–spoken corpus that might ul- AR applications seem to call for the need to ap- timately be useful to train a system for the trans- propriately tackle also the other direction, i.e. the formation of written text into text suitable to be transformation of written into (diamesically) ap- spoken. We focus on two different language va- propriate spoken language, which presents differ- rieties within the spoken-written language contin- ent challenges1 . uum, mentioned in section 1, namely radio spoken Few studies have been devoted to the automatic language and newspaper written language. This transformation or generation of suitable spoken focus was dictated both by the need to neutral- language, mostly on Japanese. Among these, Mu- ize the effects possibly deriving from considering rata and Isahara (2001) describe an interesting different topics, textual genres and/or communi- model to perform different kinds of paraphrasing cation contexts, and by the practical need of find- tasks, that is to transform sentences according to ing readily available data to run the pilot. Thus 1 VR/AR is currently a hot topic especially in both educa- the present data-set is built by aligning newspaper tional and industrial-training contexts (Akçayır and Akçayır, 2017; Żywicki et al., 2018; Gattullo et al., 2019; Heinz et al., articles, taken as representatives of the written– 2019; Albayrak et al., 2019). written variety and news broadcasting via radio, Day Num of news Average lenght Day Num of news Average lenght 13/05/2003 150 479 13/05/2003 365 60 15/05/2003 144 523 15/05/2003 321 57 17/05/2003 148 480 17/05/2003 156 73 23/05/1995 119 578 23/05/1995 1184 66 25/05/1995 125 547 25/05/1995 1106 60 27/05/1995 124 549 27/05/1995 598 83 Tot 810 526 Tot 3730 66.5 Table 1: Written corpus Table 2: Spoken corpus taken as representatives of the spoken–written va- thus obtaining 6 spoken data-sets, one for each riety. day. These were subsequently cleaned by using regular expressions that removed all annotation 3.1 Data selection and preparation tags, which provided us with raw text data for the Given the goals defined above, our first step was to alignment experiment. collect the materials for building the pilot data-set. In Table 2 we can see the number of news ex- For the spoken data-set we chose the Lessico tracted for each day and their average length in di italiano Radiofonico corpus (LIR)(Maraschio et terms of tokens. Interestingly, but not surprisingly, al., 2004)2 , which consists in transcriptions of var- we observe that newspaper articles on average are ious Italian radio broadcast channels sampled in longer than radio news. 1995 and 2003 and contains various types of an- notations among which: broadcaster, text genre, 4 Alignment methodology speaker, communication type, self-corrections, breaks, etc. In particular, we selected the tran- Once we gathered, cleaned and normalised the rel- scriptions of radio news by Radio RAI1, Radio evant data, we proceeded to align written and spo- RAI2 and Radio RAI33 which amount to 6 days ken texts on the basis of topic and semantic equiv- altogether: the 23rd, 25th, 27th May 1995, and the alence. Since the spoken transcriptions do not 13th, 15th and 17th 2003. have an explicit marking of sentence boundaries, The written data-set was created by taking all for the time being alignment is performed at text news articles published in La Repubblica on the level; we leave sentence-level alignment for future same dates4 . Tables 1 and 2 report the figures of work. the data-sets. Given the six spoken data-sets and their corre- In the case of the spoken corpus extensive ex- sponding written ones we experimented with two traction and cleaning work was required because different methods to perform their alignment. One the original transcriptions include many different is based on the Jaccard index (Jaccard hence- genres (e.g. advertisements, interviews, entertain- forth), the other method on cosine similarity (Co- ment,. . . ) and several different annotation tags. sine henceforth). Both algorithms followed one common preliminary step: for each data-set we 3.2 Spoken corpus cleaning took into consideration only nouns, verbs, adjec- From the selected days of the LIR corpus we tives and numerals, i.e. semantically heavy words. needed to extract only the transcriptions of news The first method calculates similarity using the text. The original texts in fact contain several Jaccard index as a statistical index. In general, this types of annotations, all in a proprietary tagging coefficient measures the similarity of two samples format, and news are easily recognisable. So, for through the ratio between the size of the intersec- each day mentioned, we created a data-set by col- tion and the size of the union of the sample sets; lating the news of the different radio broadcasters, so, in this case, the numerator is given by the over- 2 Source: http://www.accademiadellacrusca.it/it/attivita/ lap of words of the two documents, i.e. the number lessico-frequenza-dellitaliano-radiofonico-lir 3 of relevant words present in both. The denomina- The news transcriptions of the other broadcasters were too short for our purposes. tor instead is the sum of the relevant words of both 4 source: https://ricerca.repubblica.it/ documents. The computation can be represented as follows: higher similarity bands, we notice a growing trend |overlapping words in A, B| for both methods, but while for Cosine we ob- J(A, B) = (1) serve a gradual growth, the Jaccard method shows |words A + words B| a faster rise. Moreover, we notice that most of The range of acceptable values stands between the alignments occur in the lowest similarity range 0 (for the couples of documents that have no words of value, while in the higher similarity ranges we in common) and 0,5 (for the couples of documents found very few alignments (see Table 3 and 4 for with the highest similarity, i.e. with all relevant details). words in common). Remembering that the range of admissible val- The second method computes the cosine sim- ues are different for the two methods let us focus ilarity between a vector representing all the rel- on the results. evant words in a spoken text and a vector rep- resenting a written text. Each vector contains a Cosine alignment evaluation Cosine for both number of components identical to the amount of data-sets has an accuracy of 100% in the range of relevant words contained in the texts, the value values 0,8-0,7 and 0,6-0,5, while for the range 0,2- of each component being the TFiDF value of the 0,3 it has an accuracy of 6% for 1995’s data-sets corresponding word in the represented text. Once and 22% for 2003’s data. Figure 1 shows a gap all vectors were built, we compared each spoken- between 0.7 and 0.6 for 2003’s data. That is be- vector with every written-vector and computed cause, for this data-set, the cosine method did not their cosine similarity. Finally, considering values assign values in this range. Overall, Cosine total of similarity in decreasing order we reorganised accuracy is 61%, 53% on 1995 data and 69% on the pairs and completed document-alignment. The 2003 data. range of acceptable values for the Cosine method stands between 0 and 1, with values close to 1.0 Jaccard alignment evaluation In the range 0,3- indicating strong similarity. 0,2 the Jaccard method has an accuracy of 100% on both datasets; while for the 1995 data it drops to 4.1 Alignment evaluation 53% in the range 0,2-0,1 and to 47% in the range The two methods illustrated above produced 0,1-0,6. For the 2003 data in the range 0-2,01 the twelve output files, six for each method, all ranked accuracy is 86%, which decreases to 44,8% in the on the basis of their similarity score in decreasing range 0,1-0,07. Also in this case, as reported in Ta- order. For each of them we considered the first one ble 4, we have few alignments in higher distances hundred spoken-written text pairs and manually despite the number of lower ones. evaluated their alignments on a binary scale with Overall, Jaccard total accuracy is 50%, 50% on respect to their information content. News about 1995 data and 51% on 2003 data. the same topics, events or facts were considered According to this evaluation, Cosine using good alignments. We decided to stop the evalu- TFiDF values is the best method for aligning our ation at the first one hundred pairs, because after data. this threshold the recognised alignments were no Here is an example of text pairs with high co- longer significant (i.e. algorithms aligned pairs of sine similarity values (0,7-0,8): documents with different topics). [Spoken]: [...] il diario di Paul On the 1200 manually assessed pairs we than Mccartney [...] rottura con i Beatles calculated the accuracy of the two methods. We è stato riconsegnato [...] al cantante considered accuracy as the ratio between the num- il giorno dopo il concerto dei fori ber of aligned pairs in particular range of distance imperiali [...] Mccartney ha avuto values and the total number of couples in the same modo di rileggere quel preziosissimo range. diario stracolmo di ricordi e ha The graphics in Figures 1 and 2 show method confermato l’autenticità [...] accuracy for each range of similarity values, using alcune frasi portano il segno della both the 1995 and 2003 data. For example, in the storia "Arriva John per discutere lo range of values between 0,1 and 0,2, the Cosine scioglimento della partnership" giugno method has an accuracy of 6% with the 1995 data millenovecentosettanta la fine dei and 22% with the 2003 data. As we advance in the Beatles [Written]: [...] il diario di Paul Mccartney [...] rottura con i Beatles è stato riconsegnato [...] al cantante, il giorno dopo il concerto dei fori imperiali. [...] sir Paul ha avuto modo di rileggere quel preziosissimo diario stracolmo di ricordi, e ha confermato l’autenticità dell’agenda. [...] alcune frasi portano il segno della storia: ‘‘arriva John per discutere lo scioglimento della partnership’’. giugno 1970, la fine dei Beatles. [...] What follows instead is an example of a good Figure 1: Cosine accuracy alignment with lower cosine similarity values (0,3- 0,2)5 : [Spoken]: se non mi attaccassero non mi difenderei [...] spiega Berlusconi [...] "Io sono un moderato" ripete il premier "Mi difendo da teoremi folli che non attaccano me ma il presidente del consiglio" [...] [Written]: Berlusconi al contrattacco "Denuncerò chi mi offende". [...] E aggiunge che le accuse contro di lui si basano su "Teoremi folli". Teoremi ai quali [...] "Ho dato la risposta più moderata, contenuta e misurata che si potesse dare". [...] The first example is also an example of high Figure 2: Jaccard accuracy Jaccard similarity values (0,3-0,2). In general, with both methods, the pairs of doc- 1995 2003 Distance uments correctly aligned in the lower ranges of Correct Tot Correct Tot similarity show considerable differences in terms 0,8-0,7 1 1 2 2 of lexical items and possibly linguistic structures, 0,7-0,6 3 3 0 0 and thus represent a very interesting set of pairs for 0,6-0,5 6 6 4 4 future investigation. Regarding higher ranges, we 0,5-0,4 12 13 26 30 find a greater lexical overlap and a lower variation 0,4-0,3 45 55 45 61 in linguistic structure. Comparing the pairs cor- 0,3-0,2 90 206 123 167 rectly aligned by the two methods we counted 77 0,2-0,1 1 16 8 36 identical ones, while the number of different pairs TOT 158 300 208 300 derived from Jaccard is 220, and from Cosine 260. In total we obtained 557 different correctly aligned Table 3: Cosine Accuracy (1995-2003) pairs. 1995 2003 Distance Correct Tot Correct Tot 5 Pilot corpus profiling 0,3-0,2 5 5 3 3 The final pilot CIPPS corpus consists of 557 text 0,2-0,1 41 77 37 43 pairs corresponding to the correctly aligned and 0,1-0,065 103 218 114 254 manually validated pairs of spoken and written TOT 149 300 154 300 5 For reasons of space the example texts have been arbi- trarily shortened. Table 4: Jaccard accuracy (1995-2003) documents resulting from both alignment meth- riety shares with prototypical written language a ods. It can thus be taken as a gold-standard corpus twice higher noun/verb ratio, which, according to of content aligned text pairs of news for the dates Biber (1988), is typical of informative texts. On and years mentioned in section 3.1. the other hand, it shares with prototypical spo- This section reports on our preliminary con- ken language the more frequent use of deictic ele- trastive analysis of CIPPS using Monitor-IT ments, of 1st/2nd person reference in verbal forms, (Montemagni, 2013), so as to establish basic lin- lexical repetition. guistic profiling of the two language varieties rep- These findings, which need to be further elab- resented in the corpus. This analysis was done orated and explored, confirm the hybrid nature with a specific view to investigating similarities of the spoken language variety represented in the and differences in the distribution of multi-level CIPPS corpus, which is in line with the trend re- linguistic cues (we focus here on lexical and ported in the literature that the language of the ra- morpho-syntactic features) both within the corpus dio shares features with both spontaneous oral and and against prototypical written and spoken lan- written language varieties. guage (in the future, we plan to extend this analy- sis to the underlying syntactic structure). 6 Conclusions and Future work Let us first compare the two sections of the In this paper we have presented our first ex- CIPPS corpus. On the one hand, highly correlated periments towards the creation of the CIPPS, a features between the CIPPS written and spoken monolingual written-spoken parallel aligned cor- sections concern the distribution of nouns (both pus. The data for this pilot was drawn from ex- common and proper) and adjectives as well as ver- isting corpora and archives, it was automatically bal forms used in the third person singular; the aligned on the basis of two statistical methods and correlation was calculated with the Spearman’s finally manually validated. To the best of our Correlation Coefficient (p-value ≤ 0.05). On the knowledge, this is the first attempt to build such other hand, statistically significant different fea- a corpus and more research is needed to improve tures across the spoken and written corpus sections its potentials and increase its magnitude. detected with the Wilcoxon test (p-value ≤ 0,05) Among the open issues to be approached first include specific verbal forms, deictic elements and is the lack of punctuation in the spoken part of the determiners, prepositions and acronyms, as well as corpus, which makes automatic alignment with the lexical richness (measured in terms of Token/Type written counterpart too coarse. As mentioned in Ratio). In particular, if verbal moods such as the introduction, a corpus like ours might also be gerundive, subjunctive, infinitive and conditional precious as a training set for the development of a are typically associated with written articles, the system for transforming written into suitable spo- 1st and 2nd person of verbs in both singular and ken texts. Although little work has been done in plural forms are typical of the spoken news re- this direction, the time is now ripe to tackle the ports. Demonstrative determiners and pronouns challenge and we plan to start experimenting with represent significant features of the spoken vari- both paraphrasing methods—as mentioned in sec- ety, whereas acronyms and lexical richness mea- tion 1— and with monolingual machine transla- sured in terms of Token-Type Ratio characterise tion, taking inspiration from Quirk et al. (2004) the written CIPPS section. and Wubben et al. (2012). In this perspective, For what concerns the comparison of the lin- however, the first necessary step is to increase cor- guistic profiling results sketched above with what pus size and improve alignment. we know from the literature about features of spo- ken vs. written language, we observe that the Acknowledgments widely acknowledged fact that spoken language is less complex than written language is declinated This work was partially supported by the 2- here in quite a peculiar way. Differently from year project ADA, Automatic Data and docu- the ‘spoken-spoken’ variety characterised by a re- ments Analysis to enhance human-based pro- duced number of nouns and consequently by a cesses, funded by Regione Toscana (BANDO lower noun/verb ratio (ranging between 0,80 and POR FESR 2014-2020). 1, (Montemagni, 2013)), the ‘spoken-written’ va- References K Marimuthu and Sobha Lalitha Devi. 2014. Au- tomatic conversion of dialectal tamil text to stan- Murat Akçayır and Gökçe Akçayır. 2017. Advantages dard written tamil text using fsts. In Proceedings and challenges associated with augmented reality for of the 2014 Joint Meeting of SIGMORPHON and education: A systematic review of the literature. Ed- SIGFSM, Baltimore, Maryland, USA, June 27, 2014, ucational Research Review, 20:1 – 11. pages 37–45. M. S. Albayrak, A. Öner, I. M. Atakli, and H. K. Shigeki Matsubara and Yukiko Hayashi. 2012. Per- Ekenel. 2019. Personalized training in fast-food sonalization of news speech delivery service based restaurants using augmented reality glasses. In 2019 on transformation from written language to spo- International Symposium on Educational Technol- ken language. In Toyohide Watanabe, Junzo ogy (ISET), pages 129–133, July. Watada, Naohisa Takahashi, Robert J. Howlett, and Lakhmi C. Jain, editors, Intelligent Interactive Mul- Douglas Biber. 1988. Variation across speech and timedia: Systems and Services, pages 449–457, writing. Cambridge University Press. Berlin, Heidelberg. Springer Berlin Heidelberg. Emanuela Cresti and Massimo Moneglia. 2005. C- Simonetta Montemagni. 2013. Tecnologie linguistico- ORAL-ROM, Integrated Reference Corpora for Spo- computazionali e monitoraggio della lingua italiana. ken Romance Languages. John Benjamins. Studi italiani di linguistica teorica ed applicata, (XLII(1)):145–172. M. Gattullo, V. Dalena, A. Evangelista, A. E. Uva, M. Fiorentino, A. Boccaccio, M. Ruta, and J. L. Masaki Murata and Hitoshi Isahara. 2001. Universal Gabbard. 2019. A context-aware technical informa- model for paraphrasing - using transformation based tion manager for presentation in augmented reality. on a defined criteria. CoRR, cs.CL/0112005. In 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pages 939–940, March. Giovanni Nencioni. 1976. Parlato-parlato, parlato- scritto, parlato-recitato. Strumenti critici, (29). Manuel Giuliani, Thomas Marschall, and Amy Isard. 2014. Using ellipsis detection and word similarity Walter J. Ong. 1982. Orality and Literacy: The Tech- for transformation of spoken language into gram- nologizing of the Word. Methuen. matically valid sentences. In Proceedings of the SIGDIAL 2014 Conference, The 15th Annual Meet- Chris Quirk, Chris Brockett, and William B. Dolan. ing of the Special Interest Group on Discourse and 2004. Monolingual machine translation for para- Dialogue, 18-20 June 2014, Philadelphia, PA, USA, phrase generation. In Proceedings of the 2004 Con- pages 243–250. ference on Empirical Methods in Natural Language Processing , EMNLP 2004, A meeting of SIGDAT, a Mario Heinz, Sebastian Büttner, and Carsten Röcker. Special Interest Group of the ACL, held in conjunc- 2019. Exploring training modes for industrial aug- tion with ACL 2004, 25-26 July 2004, Barcelona, mented reality learning. In Proceedings of the 12th Spain, pages 142–149. ACM International Conference on PErvasive Tech- nologies Related to Assistive Environments, PETRA Stefania Spina. 2000. Il corpus di italiano televisivo 2019, Island of Rhodes, Greece, June 5-7, 2019, (cit): struttura e annotazione. In Atti del Convegno pages 398–401. SILFI. Sander Wubben, Antal van den Bosch, and Emiel Nobuhiro Kaji and Sadao Kurohashi. 2005. Lexical Krahmer. 2012. Sentence simplification by mono- choice via topic adaptation for paraphrasing writ- lingual machine translation. In Proceedings of the ten language to spoken language. In Natural Lan- 50th Annual Meeting of the Association for Compu- guage Processing - IJCNLP 2005, Second Interna- tational Linguistics: Long Papers - Volume 1, ACL tional Joint Conference, Jeju Island, Korea, October ’12, pages 1015–1024, Stroudsburg, PA, USA. As- 11-13, 2005, Proceedings, pages 981–992. sociation for Computational Linguistics. Nobuhiro Kaji, Masashi Okamoto, and Sadao Kuro- Krzysztof Żywicki, Przemysław Zawadzki, and Filip hashi. 2004. Paraphrasing predicates from written Górski. 2018. Virtual reality production train- language to spoken language using the web. In Hu- ing system in the scope of intelligent factory. In man Language Technology Conference of the North Anna Burduk and Dariusz Mazurkiewicz, editors, American Chapter of the Association for Computa- Intelligent Systems in Production Engineering and tional Linguistics, HLT-NAACL 2004, Boston, Mas- Maintenance – ISPEM 2017, pages 450–458, Cham. sachusetts, USA, May 2-7, 2004, pages 241–248. Springer International Publishing. Nicoletta Maraschio, Stefania Stefanelli, Stefania Buc- cioni, and Marco Biffi. 2004. Dal corpus lir: prove e confronti lessicali. In Federico Albano Leoni, Francesco Cutugno, Massimo Pettorino, and Renata Savy, editors, Atti del Convegno Nazionale “Il Par- lato Italiano”, page 36.