=Paper= {{Paper |id=Vol-2481/paper45 |storemode=property |title=KIParla Corpus: A New Resource for Spoken Italian |pdfUrl=https://ceur-ws.org/Vol-2481/paper45.pdf |volume=Vol-2481 |authors=Caterina Mauri,Silvia Ballarè,Eugenio Goria,Massimo Cerruti,Francesco Suriano |dblpUrl=https://dblp.org/rec/conf/clic-it/MauriBGCS19 }} ==KIParla Corpus: A New Resource for Spoken Italian== https://ceur-ws.org/Vol-2481/paper45.pdf
             KIParla Corpus: A New Resource for Spoken Italian1

        Caterina Mauri                         Silvia Ballarè                   Eugenio Goria
      Università di Bologna                  Università di Torino              Università di Torino
    caterina.mauri@unibo.it             silvia.ballare@unito.it             eugenio.goria@unito.it

              Massimo Cerruti                                      Francesco Suriano
             Università di Torino                                 Università di Bologna
     massimosimone.cerruti@unito.it                     francesco.suriano2@studio.unibo.it




                      Abstract                           social and situational variation that characterizes
                                                         spoken Italian. In Section 3 we discuss corpus im-
     In this paper we introduce the main fea-            plementation, describing how data have been col-
     tures of the KIParla corpus, a new re-              lected in adherence with ethical requirements,
     source for the study of spoken Italian. In          how they have been treated and transcribed, and
     addition to its other capabilities, KIParla         how they have been made accessible and searcha-
     provides access to a wide range of                  ble through NoSketch Engine. Section 4 focuses
     metadata that characterize both the partic-         on the incremental modularity of the corpus,
     ipants and the settings in which the inter-         which makes it an open monitor corpus of spoken
     actions take place. Furthermore, it is de-          Italian. The two modules that constitute the cur-
     signed to be shared as a free resource tool         rent core of KIParla, namely KIP and ParlaTO, are
     through the NoSketch Engine interface               then briefly illustrated, and some prospects for fu-
     and to be expanded as a monitor corpus              ture developments are outlined.
     (Sinclair 1991).
                                                         2     Corpus design
1     KIParla corpus: an introduction                    This section discusses the parameters taken into
                                                         account for the creation of the KIParla corpus. In
The aim of this paper is to describe the design and
implementation of a new resource tool for the            particular, we stress the relevance of extralinguis-
study of spoken Italian. The KIParla corpus is the       tic factors (regarding both the socio-geographic
result of a joint collaboration between the Univer-      profile/status of the speakers and the interactional
sities of Bologna and Turin and is open to further       contexts) in order to build a corpus suitable for in-
partnerships in the future.                              vestigating (socio)linguistic variation in contem-
   It is characterized by a number of innovative         porary Italian.
features. In addition to providing access to a wide      2.1    Aims
range of metadata concerning the speakers and the
setting in which the interactions take place, it of-     The KIParla corpus is designed to overcome some
fers transcriptions time-aligned with audio files        of the shortcomings that characterize previous re-
and is designed to be expanded and upgraded              sources used in the study of spoken Italian. It is
through the addition of independent modules,             intended to bring about major improvements con-
constructed with a similar attention to the              cerning three key aspects of corpus-based re-
metadata; moreover, it is completely open-access         search: (i) access to the speakers’ metadata, par-
and makes use of open-access technologies, such          ticularly to those concerning age and social group;
as the NoSketch Engine platform.                         (ii) the possibility to browse the corpus online as
   Section 2 provides a detailed description of the      well as to download specific recordings; (iii) text-
corpus design, aimed at featuring the geographic,        to-speech alignment.

1
  Copyright © 2019 for this paper by its authors. Use
permitted under Creative Commons License Attrib-
ution 4.0 International (CC BY 4.0).
As for (i), the possibility to recover information       ple of such a scenario is provided in (1); the con-
about the speakers or about the situation in which       versation, recorded in Turin, has two speakers us-
a conversational exchange has occurred is central        ing the progressive periphrasis stare + a + infini-
in several fields of linguistics, such as sociolin-      tive combined with the apocopated form of the
guistics and conversation analysis, and is poten-        lexical verb, which are two typical features of re-
tially relevant in many others, such as second lan-      gional varieties of Italian spoken in central Italy.
guage acquisition and language teaching. While
some corpora provide general information about             (1) GF_TO091: ho capito ma tu sei entrata
the setting of the interaction, at present there is no         troppo nella parte stai a fa’ l’attrice
other corpus of spoken Italian that offers detailed            “I see but you are getting too much into
information about single speakers. As for (ii),                this, you’re putting on an act”
KIParla will be accessible online through the
NoSketch Engine interface, and on the project                   BC_TO089: sì
website it will be possible to download all the re-             “yes”
cordings (in .wav or .mp3 format) and transcrip-
tions, as previously done for CLIPS (Albano Le-                 SF_TO090: no non sto a fa’ l’attrice io
oni 2007),VoLIP (Voghera et al. 2014), and other                parlo così normalmente come potete notare
corpora. Moreover, with regard to (iii) the re-                 ragazze
search platform will enable users to listen to the
results of single queries and download them in                  “no, I’m not putting on an act. This is the
.mp3 format, offering text-to-speech alignment.                 way I usually speak, as you can see girls”
   The philosophy behind KIParla is to pave the
way for a collection of spoken corpora, each com-                                   (KIP corpus, TOA3012)
piled according to a shared methodology in order
to facilitate comparability. For this reason, it was        In order to have a deeper understanding of the
designed as an open resource that is able to re-         situation, information regarding both the city in
ceive further implementations from external con-         which the data were collected and the place of
tributors who want to share their data; therefore, it    origin of each speaker can be retrieved.
can also be thought of as a monitor corpus (Sin-
clair 1991) which grows in size over time thanks         2.3    The diastratic dimension: a perspective
to an increasingly wide range of materials.                     on Italian society
                                                         The speakers involved in the recordings are dis-
2.2    The geographic dimension: collecting
                                                         tinguished primarily by their age and level of ed-
       data in different cities with speakers
                                                         ucation; the latter are traditionally deemed to be
       from all over Italy
                                                         the most relevant social factors for the analysis of
The diatopic dimension has always been consid-           sociolinguistic variation in Italian (see Berretta
ered to be of greatest significance when describ-        1988). Part of the KIParla corpus (see KIP module
ing the Italian sociolinguistic scenario (see            in §4.1) is focused on educated speakers, i.e. un-
Berruto 2012 inter al.); in fact, speech utterances      dergraduates, graduate students, and university
without any regional features are seldom if ever         professors. In the second data collection sample
found even among educated speakers and in for-           (see ParlaTO module in §4.2), far more social fac-
mal situations. Currently, the only spoken corpora       tors have been taken into account, and both the
that take into account geographic variation are the      age range and the level of education of the inform-
LIP corpus and the CLIPS corpus. In the KIParla          ants have been broadened. Ideally, the incremen-
corpus, thus far we have collected data in Turin         tal nature of the corpus will make it possible to
and Bologna; the sociolinguistic situation in both       explore the various dimensions of variation in
urban settings is characterized by the coexistence       depth.
of Italian and the local dialect, as well as the re-
sulting development of intermediate varieties.           2.4    Types of interaction: settings and activi-
Furthermore, even with significant differences,                 ties
both cities have been and are destinations of inter-     Building on a central assumption in the conversa-
nal mobility, and thus we are likely to find several     tion analytic framework, i.e. that linguistic prac-
varieties of Italian from other parts of Italy, as       tices are often related to specific social activities,
well as Italo-Romance dialects. One good exam-           we dedicated particular attention to including dif-
ferent types of situations, expecting to find con-      the public. The voice of the speakers is the only
siderable differences between the structures in-        sensitive data that remains directly accessible.
volved in each.
   In order to narrow down the field of analysis,       3.2       Transcription: challenges and solutions
for the first bulk of the KIParla corpus we chose       All the recordings have been transcribed by pro-
to consider various types of interaction occurring      fessional researchers and trained students or in-
in a single sociolinguistic domain (Fishman             terns using ELAN software (Sloetjes and Witten-
1972), namely the academic context.                     burg 2008). This tool is designed specifically to
   The different activities were thus classified ac-    handle multi-level annotations relating to differ-
cording to the following external factors: (i) the      ent speakers in a conversation. It also makes it
symmetrical vs asymmetrical relationship be-            possible to link each annotation to the media time-
tween the participants; (ii) the presence vs absence    line. Thanks to this feature of the software, it was
of previously established topics; (iii) the presence    possible to implement text-to-speech alignment
vs absence of constraints on turn-taking. We be-        within the NoSketch Engine interface (§3.3).
lieve, indeed, that using these three very general         Every tier in the transcription refers to an alpha-
features is particularly helpful in the task of inte-   numeric code that links the spoken production of
grating new data recorded in other situations,          a single speaker to his/her metadata (e.g. age and
without losing comparability with the other parts       level of education); similarly, each transcription
of the corpus. For example, interviews collected        file is associated with a code that allows its
with different types of speakers in the ParlaTO         metadata to be traced (e.g. type of activity, num-
section (§ 4.2) will be comparable to those col-        ber of participants, time and place of collection).
lected in the academic setting, regardless of any          The most challenging aspect of transcribing
other difference between the two sets.                  spoken data is to strike a balance between a faith-
                                                        ful representation of oral production and the
3     Building the corpus: data collection,             “searchability” of the written texts. For this rea-
      transcription, publication, and accessi-          son, we decided to adopt a simplified version of
      bility                                            the Jefferson (2004) conventions used in conver-
                                                        sation analysis (see Figure 1). An example of this
3.1    Data collection: praxis and ethics               transcription convention is provided in Figure 2.
All data have been collected by professional re-
searchers; students and interns of the Universities      ,                     Rising intonation
of Bologna and Turin have also been involved in          .                     Falling intonation
the process, but only after a period of specific         :                     Prolonged sound (each : corre-
training. Increasing the number of data collectors                             sponds to ca. 20ms)
is crucial to avoid unwanted bias caused by the in-      (.)                   Short pause
clusion of informants that belong to the same so-        >hello<               Bracketed speech is delivered
cial network. Furthermore, they acted as second-                               more rapidly
order contacts (see friend of a friend in Ta-                           Bracketed speech is delivered
gliamonte 2006: 21-22) and thus played an inter-                               more slowly
mediary role in recording spontaneous speech and         [hello]               Overlap between participants
interviews.                                              (hello)               Hardly intelligible speech
   Whenever data were being collected, speakers                                (transcriber’s best guess)
were first informed of the main aims of the project      xxx                   Unintelligible speech
and the reasons why we needed to record the in-          ((laughs))            Non-verbal behavior
teraction. They agreed to the recording and signed       =                     Prosodically attached units
a consent form that complies with the European                 Figure 1: Symbols used in the transcription based on
Union’s General Data Protection Regulation                                    Jefferson (2004)
(G.D.P.R.). The consent form allowed us to col-
lect linguistic material for scientific purposes, to
store it in hardware located in Europe and/or via
cloud services provided by universities, and to
make it available online.
   All the collected data are transcribed (see § 3.2)        Figure 2: Conversational transcription as shown in the
and anonymized before being made available to                                    corpus page
   The decision to implement conversational tran-
scription was mainly due to the fact that it enables
us to obtain a sufficient level of precision, without
forcing the researcher to make interpretive
choices. This is crucial in the handling of both per-
formance-related phenomena occurring in spoken
language (e.g. reformulations and truncated
words) and non-standard variants.
   However, as will be explained in the next sec-
tion, we decided to make the data searchable
based on the simple orthographic transcription,
while the conversational transcript is accessible as
an additional option.
3.3    Data publication: From ELAN to
       NoSketch Engine
The transcriptions obtained through ELAN are in
XML format and are automatically time-aligned
to the speech audio files; thus, they are ready to be
treated and parsed by XML-compatible technolo-
gies. Since one of our aims was to make the cor-
pus fully accessible, we decided to make data
available through the NoSketch Engine interface
(Rychlý 2007).                                                        Figure 3: Metadata selection
    NoSketch Engine is an open-source tool for
corpus management which provides a powerful                                             Spontaneous
and user-friendly interface to perform corpus                                           conversation
searches, generate word/keyword lists, retrieve                                         Exams
                                                             Type of conversation
collocations based on several statistical measures,                                     Interviews
and much more. In order to adapt the XML output                                         Lessons
of ELAN to the format required by NoSketch En-                                          Office hours
gine, we wrote a python script that allows the user                                     Bologna
to: (i) make the metadata available both as query                     City
                                                                                        Turin
filters and text information; (ii) search the ortho-                                    1
graphic and Jefferson transcriptions; (iii) directly                                    2
link every occurrence with the time-aligned por-              Number of partici-        3
tion of the media file associated with it; (iv) search            pants:                4
each module of the corpus separately.                                                   5
    Users can perform a query either by browsing                                        6
the whole corpus or by selecting one or more                                            2017/18
metadata concerning the participants or the con-                      Year
                                                                                        2019
versation in which they appear. Figure 3 shows
                                                             Relation between the       Asymmetrical
how the metadata can be selected in the corpus.
                                                                 participants           Symmetrical
As reported in Figures 4 and 5 respectively, with                   Figure 4: Conversation metadata
regard to the KIP module (§ 4.1) conversation
metadata include the type of conversation, the city         Figures 6 and 7 provide an example of a query
in which it was recorded and the year, the number        in the NoSketch Engine interface; the results ap-
of participants, and the relationship between            pear in KWIC (Keyword-In-Context) format, in
them; the participants’ metadata include occupa-         which each token is presented within a string of
tion, gender, age, and the region of origin. During      characters containing the words that precede and
data collection, the participants indicated both the     follow it. By clicking on the conversation name
city of birth and the city in which they attended        reported in blue in the left portion of the screen,
high school; however, we decided to retain only          users can access the conversation's metadata, a
the latter information as an indicator of the speak-     full transcription of the file, both in Jefferson and
ers’ region of origin.                                   text-only format, and a link to the corresponding
audio file (see Figure 6). By clicking on the token,   namely its division into independent modules and
in red, users can open a text box which provides       the ability to add new modules over time.
further context (see Figure 7).                           Modules contain different corpora of Spoken
                                                       Italian sharing the same design and a common set
                                Professor              of metadata (see §2) which have been transcribed
         Occupation                                    by ELAN and made available through NoSketch
                                Student
                                Male                   Engine by running the same script (see §3). The
            Gender                                     modules may focus on different dimensions of lin-
                                Female
                                Abruzzo                guistic variation and may collect data from differ-
                                Basilicata             ent geographical areas. However, the shared pro-
            Region                                     cedure of data collection and treatment guarantees
                                Calabria
                                ...                    a high level of mutual comparability.
                                                          Easy access to all of the metadata makes the
                                Under 25
                                                       corpus expandable, through the addition of further
                                26-30
                                                       modules focusing on different geographical, so-
                                31-35
                                                       cio-cultural, or communicative aspects, and up-
                                36-40                  gradable, through the addition of new data to ex-
         Age bracket            41-45                  isting modules. Such a dynamic nature of the
                                46-50                  KIParla corpus makes it a potential monitor cor-
                                51-55                  pus, open to additions and upgrades over time. In
                                56-60                  the following sections, we provide a brief descrip-
                                Over 60                tion of the two modules which at present consti-
           Figure 5: Participants’ metadata            tute the core of the KIParla corpus.
                                                       4.1   KIP module
                                                       The KIP subcorpus is the first section that was de-
                                                       signed within KIParla and was originally con-
                                                       ceived as a self-sufficient unit. It consists of ap-
                                                       proximately 70 hours of recorded speech collected
                                                       in Turin and Bologna (35 hours per city approxi-
           Figure 6: Conversation metadata             mately) and transcribed between 2016 and 2019.
                                                          The subcorpus is domain-specific in that it in-
                                                       cludes various types of interactions occurring
                                                       within the academic setting; moreover, from a so-
                                                       ciolinguistic perspective, it only includes speakers
                                                       whose achievements pertain to higher education,
                                                       namely university students and professors. The
                                                       social characteristics of the speakers are clearly
                  Figure 7: Context                    reflected in speech data, e.g. in the highly edu-
                                                       cated use of the relative clause in example (2).
   As of September 2019, the corpus can be ac-
cessed online at the website www.kiparla.it. At          (2) LB_BO100: abbiamo una struttura di dati,
present, it only consists of the KIP module (see             abbiamo un algoritmo attraverso il quale
4.1), but further modules are already being pro-             ci muoviamo tra queste strutture di dati
cessed and will be uploaded to the same website
(see below). The corpus has not yet been lemma-              “we have a data structure, we have an algo-
tized or POS-tagged, but such steps are planned              rithm through which we move among
for the near future.                                         these data structures.”

4    Incremental modularity: an accessible                                       (KIP corpus, BOD1007)
     open monitor corpus of spoken Italian
                                                         The structure of this subcorpus is intended to
A key feature that makes the KIParla corpus par-       maximize diaphasic variability, according to the
ticularly innovative is its incremental modularity,    parameters described in 2.4 (symmetrical vs
                                                       asymmetrical relations; presence vs absence of a
moderator; presence vs absence of a fixed topic).            (3) PST035: in quei tempi q- c’era proprio
This resulted in the selection of the contexts listed            niente da mangiare
in Figure 8, which represent ideal combinations
between such parameters.                                        “in those days there was really nothing to
                                                                eat”
Activity          Bologna            Turin
                                                                                  (ParlaTO corpus, PTB009)
spontaneous              10:00:37             06:22:24
conversation                                                 (4) PMM017: c’erano gli altri ragazzi ci ho
                                                                 fatto dei nomi
exams                    03:09:34             03:10:48
                                                                “the other boys were there, I gave them
lessons                  12:19:39             13:25:33          some names”

interviews               06:18:37             07:47:38                            (ParlaTO corpus, PTB002)

office hours             02:59:11             03:49:08      Data has been collected through semi-struc-
                                                         tured interviews about city life and personal expe-
       TOTAL             34:47:38             34:35:30   riences (urban initiatives, policies for neighbor-
                                                         hoods, leisure time activities, etc.). The corpus
 Figure 8: Hours recorded for each interaction type in   provides a rich set of metadata, geared to fostering
                 Turin and Bologna                       the investigation of linguistic variation across so-
                                                         cio-economic classes and social groups. It in-
   The complete KIP module is currently availa-          cludes such categories as age, level of education,
ble on the www.kiparla.it website.                       gender, employment status, place of birth (of both
4.2    ParlaTO module                                    the individual and their parents), mother tongue,
                                                         and knowledge of other languages, as well as du-
ParlaTO is a corpus of spontaneous speech col-           ration of stay and duration of study in Italy for first
lected in Turin between 2018 and 2019. The cor-          and second-generation immigrants. The occur-
pus is being compiled in an effort to portray a con-     rence of Italo-Romance dialects and/or foreign
temporary multilingual urban setting. In fact, Tu-       languages in speech utterances is being tagged as
rin has been, and still is, the scene of contact be-     well.
tween different languages, partly because of the            ParlaTO is thus meant to fill some crucial gaps
endogenous coexistence of Italian and Piedmon-           in the panorama of Italian speech corpora. In par-
tese, and partly as the result of both internal and      ticular, the spontaneous speech of such social
external migration patterns.                             groups as young speakers with limited educa-
   Basically, the corpus contains speech data com-       tional qualifications and first and second-genera-
ing from three categories of individuals: (i) speak-     tion immigrants can, for the first time, be the sub-
ers of Piedmontese origin, (ii) speakers from other      ject of targeted corpus-based searches online.
parts of Italy, and (iii) speakers of foreign origin,       The corpus currently amounts to approximately
i.e. first and second-generation immigrants. Ac-         60 hours of speech, one third of which is from
cordingly, the collection of data accounts for dif-      speakers of foreign origin. However, ParlaTO is
ferent languages and language varieties, namely          still under construction and will not be available
Italian – either as L1 or L2 – and, to a lesser ex-      online until early 2020.
tent, immigrant minority languages and Piedmon-
tese, as well as other Italo-Romance dialects.           5     Conclusions and future prospects
Therefore, the corpus makes it possible to investi-
gate a wide range of phenomena.Below are just a          The ParlaTO corpus has been added to the KIP
couple of examples of Italian as L1: a case of sub-      corpus, thereby creating two modules within the
stratum interference in (3), i.e. the absence of a       larger KIParla corpus. We aim to make this re-
preverbal negative marker (which characterizes           source grow over time through subsequent addi-
most Northern Italo-Romance dialects), and a typ-        tions and upgrades. The leading idea is that the
ical feature of uneducated speech in (4), i.e. the       greater the variety of interactions, speakers, and
use of ci as 3pl indirect object clitic pronoun.         geographical areas recorded in the KIParla data,
                                                         the more the corpus will become representative of
                                                         the language(s) and language varieties spoken in
Italy. Moreover, as the corpus is upgraded over               sociolinguistics. The ethnography of communication,
time, it will tell us more and more about the soci-           New York, Holt, Rinehart and Winston, 435-453.
olinguistic situation in the Italian peninsula.               Jefferson, Gail (2004), “Glossary of transcript symbols
   We envision the future development of the cor-             with an introduction”. In: Lerner, Gene H. (ed.), Con-
pus to proceed in two main directions. On the one             versation Analysis: studies from the first generation,
hand, we intend to collaborate with existing pro-             Amsterdam, John Benjamins, 13-31.
jects, in order to verify whether data already col-           Tagliamonte, Sali A. (2006), Analysing sociolinguistic
lected for different purposes may be adapted into             variation, Cambridge, Cambridge University Press.
new modules of the KIParla corpus. The only re-
quirement in such cases is the ability to trace and           Panunzi, Alessandro, Eugenio Picchi and Massimo
                                                              Moneglia (2004), “Using PiTagger for Lemmatization
access a core set of metadata for the speakers
                                                              and PoS Tagging of a Spontaneous Speech Corpus: C-
(gender, age, geographical information, level of              Oral-Rom Italian”. In: Proceeding of Fourth Language
education, and occupation) and for the interaction            Resources and Evaluation Conference (LREC 2004).
(interview, free conversation, etc.). Further
metadata would of course be welcome. Moreover,                Rychlý, Pavel (2007), “Manatee/Bonito – A Modular
                                                              Corpus Manager”. In: 1st Workshop on Recent Ad-
new data collection efforts have already started or
                                                              vances in Slavonic Natural Language Processing,
are scheduled to start in different regions (e.g. in          Brno, Masaryk University, 65-70.
Lombardy). A data collection project parallel to
ParlaTO is also planned for Bologna.                          Sinclair, John (1991), Corpus, Concordance, Colloca-
   The second direction along which KIParla will              tion, Oxford, Oxford University Press.
grow has to do with data annotation. For the mo-              Voghera, Miriam, Claudio Iacobini, Renata Savy,
ment, KIParla data are available as prosodic and              Francrsco Cutugno, Aurelio De Rosa and Iolanda Al-
orthographic transcriptions, time-aligned with the            fano (2014), “VoLIP: A searchable Italian spoken cor-
speech audio file and linked to the metadata of               pus”. In: Vaselovská, Ludmila and Markéta Marjane-
speakers and interactions. Further functions are              bová (eds.), Complex visibles out there. Proceedings of
                                                              the Olomouc Linguistics Colloquium: Language use
offered by NoSketch Engine, such as word
                                                              and linguistic structure, Olomouc, Palacký University,
sketches, thesaurus, and keyword computation.                 628-640.
   We plan two further stages of annotation,
namely lemmatization and POS-tagging, which
will significantly enhance data retrieval. Due to
space constraints, we are unable to discuss the
problems that lemmatization and POS-tagging
raise when applied to spoken data (cf. Panunzi,
Picchi, Moneglia 2004), and leave such a crucial
discussion to future work.


References
Albano Leoni, Federico (2007), “Un frammento di sto-
ria recente della ricerca (linguistica) italiana. Il corpus
CLIPS”. In: Bollettino d’Italianistica, IV, (2), 122-130.
Berretta, Monica (1988), “Italienisch: Varietätenlin-
guistik des Italienischen/Linguistica delle varietà”. In:
Lexicon der Romanistischen Linguistik, vol. IV 762-
774.
Berruto, Gaetano (2012), Sociolinguistica dell’italiano
contemporaneo. Seconda edizione, Roma, Carocci.
De Mauro, Tullio, Federico Mancini, Massimo Vedo-
velli and Miriam Voghera (1993), Lessico di frequenza
dell’italiano parlato, Milano, Etaslibri.
Fishman, Joshua (1972), “Domains and the relation-
ship between micro- and macrosociolinguistics. In:
Gumperz, John and Dell Hymes (eds.), Directions in