=Paper=
{{Paper
|id=Vol-2481/paper45
|storemode=property
|title=KIParla Corpus: A New Resource for Spoken Italian
|pdfUrl=https://ceur-ws.org/Vol-2481/paper45.pdf
|volume=Vol-2481
|authors=Caterina Mauri,Silvia Ballarè,Eugenio Goria,Massimo Cerruti,Francesco Suriano
|dblpUrl=https://dblp.org/rec/conf/clic-it/MauriBGCS19
}}
==KIParla Corpus: A New Resource for Spoken Italian==
KIParla Corpus: A New Resource for Spoken Italian1 Caterina Mauri Silvia Ballarè Eugenio Goria Università di Bologna Università di Torino Università di Torino caterina.mauri@unibo.it silvia.ballare@unito.it eugenio.goria@unito.it Massimo Cerruti Francesco Suriano Università di Torino Università di Bologna massimosimone.cerruti@unito.it francesco.suriano2@studio.unibo.it Abstract social and situational variation that characterizes spoken Italian. In Section 3 we discuss corpus im- In this paper we introduce the main fea- plementation, describing how data have been col- tures of the KIParla corpus, a new re- lected in adherence with ethical requirements, source for the study of spoken Italian. In how they have been treated and transcribed, and addition to its other capabilities, KIParla how they have been made accessible and searcha- provides access to a wide range of ble through NoSketch Engine. Section 4 focuses metadata that characterize both the partic- on the incremental modularity of the corpus, ipants and the settings in which the inter- which makes it an open monitor corpus of spoken actions take place. Furthermore, it is de- Italian. The two modules that constitute the cur- signed to be shared as a free resource tool rent core of KIParla, namely KIP and ParlaTO, are through the NoSketch Engine interface then briefly illustrated, and some prospects for fu- and to be expanded as a monitor corpus ture developments are outlined. (Sinclair 1991). 2 Corpus design 1 KIParla corpus: an introduction This section discusses the parameters taken into account for the creation of the KIParla corpus. In The aim of this paper is to describe the design and implementation of a new resource tool for the particular, we stress the relevance of extralinguis- study of spoken Italian. The KIParla corpus is the tic factors (regarding both the socio-geographic result of a joint collaboration between the Univer- profile/status of the speakers and the interactional sities of Bologna and Turin and is open to further contexts) in order to build a corpus suitable for in- partnerships in the future. vestigating (socio)linguistic variation in contem- It is characterized by a number of innovative porary Italian. features. In addition to providing access to a wide 2.1 Aims range of metadata concerning the speakers and the setting in which the interactions take place, it of- The KIParla corpus is designed to overcome some fers transcriptions time-aligned with audio files of the shortcomings that characterize previous re- and is designed to be expanded and upgraded sources used in the study of spoken Italian. It is through the addition of independent modules, intended to bring about major improvements con- constructed with a similar attention to the cerning three key aspects of corpus-based re- metadata; moreover, it is completely open-access search: (i) access to the speakers’ metadata, par- and makes use of open-access technologies, such ticularly to those concerning age and social group; as the NoSketch Engine platform. (ii) the possibility to browse the corpus online as Section 2 provides a detailed description of the well as to download specific recordings; (iii) text- corpus design, aimed at featuring the geographic, to-speech alignment. 1 Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attrib- ution 4.0 International (CC BY 4.0). As for (i), the possibility to recover information ple of such a scenario is provided in (1); the con- about the speakers or about the situation in which versation, recorded in Turin, has two speakers us- a conversational exchange has occurred is central ing the progressive periphrasis stare + a + infini- in several fields of linguistics, such as sociolin- tive combined with the apocopated form of the guistics and conversation analysis, and is poten- lexical verb, which are two typical features of re- tially relevant in many others, such as second lan- gional varieties of Italian spoken in central Italy. guage acquisition and language teaching. While some corpora provide general information about (1) GF_TO091: ho capito ma tu sei entrata the setting of the interaction, at present there is no troppo nella parte stai a fa’ l’attrice other corpus of spoken Italian that offers detailed “I see but you are getting too much into information about single speakers. As for (ii), this, you’re putting on an act” KIParla will be accessible online through the NoSketch Engine interface, and on the project BC_TO089: sì website it will be possible to download all the re- “yes” cordings (in .wav or .mp3 format) and transcrip- tions, as previously done for CLIPS (Albano Le- SF_TO090: no non sto a fa’ l’attrice io oni 2007),VoLIP (Voghera et al. 2014), and other parlo così normalmente come potete notare corpora. Moreover, with regard to (iii) the re- ragazze search platform will enable users to listen to the results of single queries and download them in “no, I’m not putting on an act. This is the .mp3 format, offering text-to-speech alignment. way I usually speak, as you can see girls” The philosophy behind KIParla is to pave the way for a collection of spoken corpora, each com- (KIP corpus, TOA3012) piled according to a shared methodology in order to facilitate comparability. For this reason, it was In order to have a deeper understanding of the designed as an open resource that is able to re- situation, information regarding both the city in ceive further implementations from external con- which the data were collected and the place of tributors who want to share their data; therefore, it origin of each speaker can be retrieved. can also be thought of as a monitor corpus (Sin- clair 1991) which grows in size over time thanks 2.3 The diastratic dimension: a perspective to an increasingly wide range of materials. on Italian society The speakers involved in the recordings are dis- 2.2 The geographic dimension: collecting tinguished primarily by their age and level of ed- data in different cities with speakers ucation; the latter are traditionally deemed to be from all over Italy the most relevant social factors for the analysis of The diatopic dimension has always been consid- sociolinguistic variation in Italian (see Berretta ered to be of greatest significance when describ- 1988). Part of the KIParla corpus (see KIP module ing the Italian sociolinguistic scenario (see in §4.1) is focused on educated speakers, i.e. un- Berruto 2012 inter al.); in fact, speech utterances dergraduates, graduate students, and university without any regional features are seldom if ever professors. In the second data collection sample found even among educated speakers and in for- (see ParlaTO module in §4.2), far more social fac- mal situations. Currently, the only spoken corpora tors have been taken into account, and both the that take into account geographic variation are the age range and the level of education of the inform- LIP corpus and the CLIPS corpus. In the KIParla ants have been broadened. Ideally, the incremen- corpus, thus far we have collected data in Turin tal nature of the corpus will make it possible to and Bologna; the sociolinguistic situation in both explore the various dimensions of variation in urban settings is characterized by the coexistence depth. of Italian and the local dialect, as well as the re- sulting development of intermediate varieties. 2.4 Types of interaction: settings and activi- Furthermore, even with significant differences, ties both cities have been and are destinations of inter- Building on a central assumption in the conversa- nal mobility, and thus we are likely to find several tion analytic framework, i.e. that linguistic prac- varieties of Italian from other parts of Italy, as tices are often related to specific social activities, well as Italo-Romance dialects. One good exam- we dedicated particular attention to including dif- ferent types of situations, expecting to find con- the public. The voice of the speakers is the only siderable differences between the structures in- sensitive data that remains directly accessible. volved in each. In order to narrow down the field of analysis, 3.2 Transcription: challenges and solutions for the first bulk of the KIParla corpus we chose All the recordings have been transcribed by pro- to consider various types of interaction occurring fessional researchers and trained students or in- in a single sociolinguistic domain (Fishman terns using ELAN software (Sloetjes and Witten- 1972), namely the academic context. burg 2008). This tool is designed specifically to The different activities were thus classified ac- handle multi-level annotations relating to differ- cording to the following external factors: (i) the ent speakers in a conversation. It also makes it symmetrical vs asymmetrical relationship be- possible to link each annotation to the media time- tween the participants; (ii) the presence vs absence line. Thanks to this feature of the software, it was of previously established topics; (iii) the presence possible to implement text-to-speech alignment vs absence of constraints on turn-taking. We be- within the NoSketch Engine interface (§3.3). lieve, indeed, that using these three very general Every tier in the transcription refers to an alpha- features is particularly helpful in the task of inte- numeric code that links the spoken production of grating new data recorded in other situations, a single speaker to his/her metadata (e.g. age and without losing comparability with the other parts level of education); similarly, each transcription of the corpus. For example, interviews collected file is associated with a code that allows its with different types of speakers in the ParlaTO metadata to be traced (e.g. type of activity, num- section (§ 4.2) will be comparable to those col- ber of participants, time and place of collection). lected in the academic setting, regardless of any The most challenging aspect of transcribing other difference between the two sets. spoken data is to strike a balance between a faith- ful representation of oral production and the 3 Building the corpus: data collection, “searchability” of the written texts. For this rea- transcription, publication, and accessi- son, we decided to adopt a simplified version of bility the Jefferson (2004) conventions used in conver- sation analysis (see Figure 1). An example of this 3.1 Data collection: praxis and ethics transcription convention is provided in Figure 2. All data have been collected by professional re- searchers; students and interns of the Universities , Rising intonation of Bologna and Turin have also been involved in . Falling intonation the process, but only after a period of specific : Prolonged sound (each : corre- training. Increasing the number of data collectors sponds to ca. 20ms) is crucial to avoid unwanted bias caused by the in- (.) Short pause clusion of informants that belong to the same so- >hello< Bracketed speech is delivered cial network. Furthermore, they acted as second- more rapidly order contacts (see friend of a friend in Ta-Bracketed speech is delivered gliamonte 2006: 21-22) and thus played an inter- more slowly mediary role in recording spontaneous speech and [hello] Overlap between participants interviews. (hello) Hardly intelligible speech Whenever data were being collected, speakers (transcriber’s best guess) were first informed of the main aims of the project xxx Unintelligible speech and the reasons why we needed to record the in- ((laughs)) Non-verbal behavior teraction. They agreed to the recording and signed = Prosodically attached units a consent form that complies with the European Figure 1: Symbols used in the transcription based on Union’s General Data Protection Regulation Jefferson (2004) (G.D.P.R.). The consent form allowed us to col- lect linguistic material for scientific purposes, to store it in hardware located in Europe and/or via cloud services provided by universities, and to make it available online. All the collected data are transcribed (see § 3.2) Figure 2: Conversational transcription as shown in the and anonymized before being made available to corpus page The decision to implement conversational tran- scription was mainly due to the fact that it enables us to obtain a sufficient level of precision, without forcing the researcher to make interpretive choices. This is crucial in the handling of both per- formance-related phenomena occurring in spoken language (e.g. reformulations and truncated words) and non-standard variants. However, as will be explained in the next sec- tion, we decided to make the data searchable based on the simple orthographic transcription, while the conversational transcript is accessible as an additional option. 3.3 Data publication: From ELAN to NoSketch Engine The transcriptions obtained through ELAN are in XML format and are automatically time-aligned to the speech audio files; thus, they are ready to be treated and parsed by XML-compatible technolo- gies. Since one of our aims was to make the cor- pus fully accessible, we decided to make data available through the NoSketch Engine interface (Rychlý 2007). Figure 3: Metadata selection NoSketch Engine is an open-source tool for corpus management which provides a powerful Spontaneous and user-friendly interface to perform corpus conversation searches, generate word/keyword lists, retrieve Exams Type of conversation collocations based on several statistical measures, Interviews and much more. In order to adapt the XML output Lessons of ELAN to the format required by NoSketch En- Office hours gine, we wrote a python script that allows the user Bologna to: (i) make the metadata available both as query City Turin filters and text information; (ii) search the ortho- 1 graphic and Jefferson transcriptions; (iii) directly 2 link every occurrence with the time-aligned por- Number of partici- 3 tion of the media file associated with it; (iv) search pants: 4 each module of the corpus separately. 5 Users can perform a query either by browsing 6 the whole corpus or by selecting one or more 2017/18 metadata concerning the participants or the con- Year 2019 versation in which they appear. Figure 3 shows Relation between the Asymmetrical how the metadata can be selected in the corpus. participants Symmetrical As reported in Figures 4 and 5 respectively, with Figure 4: Conversation metadata regard to the KIP module (§ 4.1) conversation metadata include the type of conversation, the city Figures 6 and 7 provide an example of a query in which it was recorded and the year, the number in the NoSketch Engine interface; the results ap- of participants, and the relationship between pear in KWIC (Keyword-In-Context) format, in them; the participants’ metadata include occupa- which each token is presented within a string of tion, gender, age, and the region of origin. During characters containing the words that precede and data collection, the participants indicated both the follow it. By clicking on the conversation name city of birth and the city in which they attended reported in blue in the left portion of the screen, high school; however, we decided to retain only users can access the conversation's metadata, a the latter information as an indicator of the speak- full transcription of the file, both in Jefferson and ers’ region of origin. text-only format, and a link to the corresponding audio file (see Figure 6). By clicking on the token, namely its division into independent modules and in red, users can open a text box which provides the ability to add new modules over time. further context (see Figure 7). Modules contain different corpora of Spoken Italian sharing the same design and a common set Professor of metadata (see §2) which have been transcribed Occupation by ELAN and made available through NoSketch Student Male Engine by running the same script (see §3). The Gender modules may focus on different dimensions of lin- Female Abruzzo guistic variation and may collect data from differ- Basilicata ent geographical areas. However, the shared pro- Region cedure of data collection and treatment guarantees Calabria ... a high level of mutual comparability. Easy access to all of the metadata makes the Under 25 corpus expandable, through the addition of further 26-30 modules focusing on different geographical, so- 31-35 cio-cultural, or communicative aspects, and up- 36-40 gradable, through the addition of new data to ex- Age bracket 41-45 isting modules. Such a dynamic nature of the 46-50 KIParla corpus makes it a potential monitor cor- 51-55 pus, open to additions and upgrades over time. In 56-60 the following sections, we provide a brief descrip- Over 60 tion of the two modules which at present consti- Figure 5: Participants’ metadata tute the core of the KIParla corpus. 4.1 KIP module The KIP subcorpus is the first section that was de- signed within KIParla and was originally con- ceived as a self-sufficient unit. It consists of ap- proximately 70 hours of recorded speech collected in Turin and Bologna (35 hours per city approxi- Figure 6: Conversation metadata mately) and transcribed between 2016 and 2019. The subcorpus is domain-specific in that it in- cludes various types of interactions occurring within the academic setting; moreover, from a so- ciolinguistic perspective, it only includes speakers whose achievements pertain to higher education, namely university students and professors. The social characteristics of the speakers are clearly Figure 7: Context reflected in speech data, e.g. in the highly edu- cated use of the relative clause in example (2). As of September 2019, the corpus can be ac- cessed online at the website www.kiparla.it. At (2) LB_BO100: abbiamo una struttura di dati, present, it only consists of the KIP module (see abbiamo un algoritmo attraverso il quale 4.1), but further modules are already being pro- ci muoviamo tra queste strutture di dati cessed and will be uploaded to the same website (see below). The corpus has not yet been lemma- “we have a data structure, we have an algo- tized or POS-tagged, but such steps are planned rithm through which we move among for the near future. these data structures.” 4 Incremental modularity: an accessible (KIP corpus, BOD1007) open monitor corpus of spoken Italian The structure of this subcorpus is intended to A key feature that makes the KIParla corpus par- maximize diaphasic variability, according to the ticularly innovative is its incremental modularity, parameters described in 2.4 (symmetrical vs asymmetrical relations; presence vs absence of a moderator; presence vs absence of a fixed topic). (3) PST035: in quei tempi q- c’era proprio This resulted in the selection of the contexts listed niente da mangiare in Figure 8, which represent ideal combinations between such parameters. “in those days there was really nothing to eat” Activity Bologna Turin (ParlaTO corpus, PTB009) spontaneous 10:00:37 06:22:24 conversation (4) PMM017: c’erano gli altri ragazzi ci ho fatto dei nomi exams 03:09:34 03:10:48 “the other boys were there, I gave them lessons 12:19:39 13:25:33 some names” interviews 06:18:37 07:47:38 (ParlaTO corpus, PTB002) office hours 02:59:11 03:49:08 Data has been collected through semi-struc- tured interviews about city life and personal expe- TOTAL 34:47:38 34:35:30 riences (urban initiatives, policies for neighbor- hoods, leisure time activities, etc.). The corpus Figure 8: Hours recorded for each interaction type in provides a rich set of metadata, geared to fostering Turin and Bologna the investigation of linguistic variation across so- cio-economic classes and social groups. It in- The complete KIP module is currently availa- cludes such categories as age, level of education, ble on the www.kiparla.it website. gender, employment status, place of birth (of both 4.2 ParlaTO module the individual and their parents), mother tongue, and knowledge of other languages, as well as du- ParlaTO is a corpus of spontaneous speech col- ration of stay and duration of study in Italy for first lected in Turin between 2018 and 2019. The cor- and second-generation immigrants. The occur- pus is being compiled in an effort to portray a con- rence of Italo-Romance dialects and/or foreign temporary multilingual urban setting. In fact, Tu- languages in speech utterances is being tagged as rin has been, and still is, the scene of contact be- well. tween different languages, partly because of the ParlaTO is thus meant to fill some crucial gaps endogenous coexistence of Italian and Piedmon- in the panorama of Italian speech corpora. In par- tese, and partly as the result of both internal and ticular, the spontaneous speech of such social external migration patterns. groups as young speakers with limited educa- Basically, the corpus contains speech data com- tional qualifications and first and second-genera- ing from three categories of individuals: (i) speak- tion immigrants can, for the first time, be the sub- ers of Piedmontese origin, (ii) speakers from other ject of targeted corpus-based searches online. parts of Italy, and (iii) speakers of foreign origin, The corpus currently amounts to approximately i.e. first and second-generation immigrants. Ac- 60 hours of speech, one third of which is from cordingly, the collection of data accounts for dif- speakers of foreign origin. However, ParlaTO is ferent languages and language varieties, namely still under construction and will not be available Italian – either as L1 or L2 – and, to a lesser ex- online until early 2020. tent, immigrant minority languages and Piedmon- tese, as well as other Italo-Romance dialects. 5 Conclusions and future prospects Therefore, the corpus makes it possible to investi- gate a wide range of phenomena.Below are just a The ParlaTO corpus has been added to the KIP couple of examples of Italian as L1: a case of sub- corpus, thereby creating two modules within the stratum interference in (3), i.e. the absence of a larger KIParla corpus. We aim to make this re- preverbal negative marker (which characterizes source grow over time through subsequent addi- most Northern Italo-Romance dialects), and a typ- tions and upgrades. The leading idea is that the ical feature of uneducated speech in (4), i.e. the greater the variety of interactions, speakers, and use of ci as 3pl indirect object clitic pronoun. geographical areas recorded in the KIParla data, the more the corpus will become representative of the language(s) and language varieties spoken in Italy. Moreover, as the corpus is upgraded over sociolinguistics. The ethnography of communication, time, it will tell us more and more about the soci- New York, Holt, Rinehart and Winston, 435-453. olinguistic situation in the Italian peninsula. Jefferson, Gail (2004), “Glossary of transcript symbols We envision the future development of the cor- with an introduction”. In: Lerner, Gene H. (ed.), Con- pus to proceed in two main directions. On the one versation Analysis: studies from the first generation, hand, we intend to collaborate with existing pro- Amsterdam, John Benjamins, 13-31. jects, in order to verify whether data already col- Tagliamonte, Sali A. (2006), Analysing sociolinguistic lected for different purposes may be adapted into variation, Cambridge, Cambridge University Press. new modules of the KIParla corpus. The only re- quirement in such cases is the ability to trace and Panunzi, Alessandro, Eugenio Picchi and Massimo Moneglia (2004), “Using PiTagger for Lemmatization access a core set of metadata for the speakers and PoS Tagging of a Spontaneous Speech Corpus: C- (gender, age, geographical information, level of Oral-Rom Italian”. In: Proceeding of Fourth Language education, and occupation) and for the interaction Resources and Evaluation Conference (LREC 2004). (interview, free conversation, etc.). Further metadata would of course be welcome. Moreover, Rychlý, Pavel (2007), “Manatee/Bonito – A Modular Corpus Manager”. In: 1st Workshop on Recent Ad- new data collection efforts have already started or vances in Slavonic Natural Language Processing, are scheduled to start in different regions (e.g. in Brno, Masaryk University, 65-70. Lombardy). A data collection project parallel to ParlaTO is also planned for Bologna. Sinclair, John (1991), Corpus, Concordance, Colloca- The second direction along which KIParla will tion, Oxford, Oxford University Press. grow has to do with data annotation. For the mo- Voghera, Miriam, Claudio Iacobini, Renata Savy, ment, KIParla data are available as prosodic and Francrsco Cutugno, Aurelio De Rosa and Iolanda Al- orthographic transcriptions, time-aligned with the fano (2014), “VoLIP: A searchable Italian spoken cor- speech audio file and linked to the metadata of pus”. In: Vaselovská, Ludmila and Markéta Marjane- speakers and interactions. Further functions are bová (eds.), Complex visibles out there. Proceedings of the Olomouc Linguistics Colloquium: Language use offered by NoSketch Engine, such as word and linguistic structure, Olomouc, Palacký University, sketches, thesaurus, and keyword computation. 628-640. We plan two further stages of annotation, namely lemmatization and POS-tagging, which will significantly enhance data retrieval. Due to space constraints, we are unable to discuss the problems that lemmatization and POS-tagging raise when applied to spoken data (cf. Panunzi, Picchi, Moneglia 2004), and leave such a crucial discussion to future work. References Albano Leoni, Federico (2007), “Un frammento di sto- ria recente della ricerca (linguistica) italiana. Il corpus CLIPS”. In: Bollettino d’Italianistica, IV, (2), 122-130. Berretta, Monica (1988), “Italienisch: Varietätenlin- guistik des Italienischen/Linguistica delle varietà”. In: Lexicon der Romanistischen Linguistik, vol. IV 762- 774. Berruto, Gaetano (2012), Sociolinguistica dell’italiano contemporaneo. Seconda edizione, Roma, Carocci. De Mauro, Tullio, Federico Mancini, Massimo Vedo- velli and Miriam Voghera (1993), Lessico di frequenza dell’italiano parlato, Milano, Etaslibri. Fishman, Joshua (1972), “Domains and the relation- ship between micro- and macrosociolinguistics. In: Gumperz, John and Dell Hymes (eds.), Directions in