DialettiBot: a Telegram Bot
              for Crowdsourcing Recordings of Italian Dialects

   Federico Sangati                Ekaterina Abramova                    Johanna Monti
 University L’Orientale             Nijmegen University                University L’Orientale
     Naples, Italy                    The Netherlands                      Naples, Italy
fsangati@unior.it                e.abramova@ftr.ru.nl                  jmonti@unior.it


                 Abstract                       1       Introduction
English. In this paper we describe Dialet-      It is commonly known that Italy has an abundance
tiBot, a Telegram based chatbot for crowd-      of different dialects, such as Florentine, Venetian,
sourcing geo-referenced voice recordings        and Neapolitan. These dialects are not only char-
of Italian dialects. The system enables         acterized by simple phonetic variation as it is usu-
people to listen to previously recorded au-     ally meant by this term, but they are proper Ro-
dio and encourages them to contribute to        mance languages, with a fully developed grammar
building a collective linguistic resource by    and lexicon. As Repetti puts it:
sending voice recordings of their own spo-
                                                        The Italian ‘dialects’ [...] are daughter
ken dialects. The project aims at collecting
                                                        languages of Latin and sister languages
a large sample of voice recordings in order
                                                        of each other, of standard Italian, and of
to promote knowledge of linguistic varia-
                                                        other Romance languages, and they may
tion and preserve proverbs or idioms typi-
                                                        be as different from each other and from
cal for different local dialects. Moreover,
                                                        standard Italian as French is from Por-
the collected data can contribute to several
                                                        tuguese. (Repetti, 2000)
voice-based Natural Language Processing
(NLP) applications in helping them under-          This dialectical variety is a resource that de-
stand utterances in non-standard Italian.       serves to be studied and preserved for both cul-
                                                tural and applied reasons. The former, because
Italiano.
                                                it is quickly disappearing with less and less peo-
In questo articolo descriviamo Dialet-          ple who regularly use dialect at home and in pub-
tiBot, un chatbot basato su Telegram            lic places. According to UNESCO “Atlas of the
per raccogliere registrazioni audio geo-        World’s Languages in Danger”,1 there are about
referenziate di dialetti italiani. Il sistema   2,500 endangered languages worldwide. In Italy,
permette alle persone di ascoltare le reg-      thirty dialects are at risk of extinction, such as friu-
istrazioni precedentemente inserite, e le       lano, ladino and veneciano.2 The applied motiva-
incoraggia a contribuire alla costruzione       tion is that in recent years we have witnessed a sig-
di questa risorsa linguistica collettiva,       nificant growth in the number of voice-based NLP
attraverso l’invio di registrazioni audio       applications (such as virtual assistants), which are
nel proprio dialetto. Il progetto mira          currently not trained on local dialects and there-
a raccogliere una grande mole di regi-          fore perform poorly with a number of Italian
strazioni che possono aiutare a promuo-         speakers.
vere la conoscenza delle variazioni lin-           In this paper we present a freely available tool
guistiche e la salvaguardia dei proverbi o      that enables geo-referenced recording of Italian
modi di dire tipici di ogni dialetto locale.    dialects: DialettiBot, a Telegram based chatbot,
I dati raccolti possono inoltre contribuire     whose aim is to collect a large sample of voice
a diverse applicazioni del trattamento au-      recordings, promoting preservation of linguistic
tomatico del linguaggio (TAL) che hanno             1
                                                    http://www.unesco.org/languages-atlas
bisogno di essere adattate per compren-             2
                                                    http://www.culturaitalia.it/opencms/en/contenuti/focus/
dere espressioni dialettali.                    UNESCO_warns_that_thirty_Italian_dialects_are_at_risk_
                                                of_extinction.html?language=en
variation and its use in NLP applications. The rest                      as heard around the world. With roughly
of the paper is organized as follows: in section 2                       1,400 samples from 120 countries and territo-
we describe related work, in section 3 the imple-                        ries, and more than 170 hours of recordings,
mented system and in section 4 the collected data.                       IDEA is now the largest archive of its kind.6

2       Related work                                             MICROCONTACT aims at developing a theory
                                                                    of syntactic change by observing the evolu-
There has been an extensive linguistic research of                  tion of the dialects spoken by Italians who
Italian dialects (Lepschy and Lepschy, 1992; Bel-                   have migrated to North and South America
letti, 1993; D’Alessandro et al., 2010). Here we                    during the 20th century.7
summarize a number of projects that relate to the
idea of gathering linguistic recordings for produc-              SPEAKUNIQUE and VOCALID are two sim-
ing a map of dialects. We also point out their lim-                 ilar projects that aim at collecting English
itations that inspire our project.                                  voice sample from different regions for cre-
                                                                    ating personalized digital voices for commu-
VIVALDI project the “Vivaio Acustico delle
                                                                    nication text to speech devices.8
   Lingue e dei Dialetti d’Italia” is a collec-
   tion of recordings and transcriptions of fixed                   Our project aims to be an updated and contin-
   phrases in the dialects of different cities from              uously evolving initiative that can capture sponta-
   all regions in Italy (Kattenbusch et al., 1998).              neous (living) dialectical variation over the whole
   Unfortunately, the project is no longer active                Italian territory by being freely accessible and easy
   and has mainly focused on a finite set of cho-                to use for a variety of non-specialists. As such,
   sen sentences, as opposed to spontaneous ut-                  the project follows methodological practices simi-
   terances.                                                     lar to other citizen-science projects (Gurevych and
                                                                 Zesch, 2013; Simpson et al., 2014; Hosseini et al.,
LOCALINGUAL A web application for crowd-
                                                                 2014), it incorporates a GWAP9 feature (Lafour-
   sourcing recordings from around the world.
                                                                 cade et al., 2015), and fits within the line of ‘ex-
   This project is the one that most closely re-
                                                                 plicit crowdsourcing’ as defined by the EnetCol-
   lates to ours. The main difference is that it is
                                                                 lect10 COST11 action.
   not restricted to a specific country, does not
   use geo-locations and works via a web appli-                  3       System description
   cation, which makes it difficult to be used on
   mobile devices or in case of poor data con-                   In order to crowdsource recordings from Italian di-
   nection.3                                                     alects, we have built a Telegram chatbot: Dialet-
                                                                 tiBot.12 As shown in the screenshot in figure 1,
ALF Atlas Linguistique de la France: an in-                      the user can interact with the bot via a standard
   fluential dialect atlas of Romance varieties                  dialogue chat interface in a Telegram application
   in France published in 13 volumes between                     which is freely available for all mobile or desktop
   1902 and 1910 (Gilliéron and Edmont, 1902).                   operating systems.13 Apart from textual input, the
   An example of more recent work of this type                   interface provides a small keyboard of buttons that
   is Hall, Damien (2012).4                                      changes during the dialogue flow to simplify the
                                                                 interaction. In addition, the bot is able to accept
ALD Linguistic Atlas of Dolomitic Ladinian and
                                                                 vocal recordings and GPS locations.
   neighbouring Dialects (Skubic, 2000). The
                                                                    The bot gives the possibility to the user to listen
   project studies the linguistic variation be-
                                                                 to approved recordings or to add new ones.
   tween dialects of the region which covers the
                                                                    In the listening mode, it is possible to search
   Grisons and Friuli region.5
                                                                 for recording based on location or view the list
IDEA The International Dialects of English                           6
                                                                        https://www.dialectsarchive.com
   Archive was created in 1998 as the inter-                         7
                                                                        https://microcontact.sites.uu.nl/project
                                                                      8
   net’s first archive of primary-source record-                        https://www.speakunique.org, https://www.vocalid.co
                                                                      9
   ings of English-language dialects and accents                        Game with a purpose.
                                                                     10
                                                                        http://enetcollect.eurac.edu
    3                                                                11
      https://localingual.com                                           European Cooperation in Science and Technology.
    4                                                                12
      http://cartodialect.imag.fr/cartoDialect/accueil                  https://t.me/dialettibot
    5                                                                13
      https://www.micura.it/en/activities/ald-linguistic-atlas          https://telegram.org/apps
                                                         Figure 2: Screenshot of the web application dis-
                                                         playing the audio map of the approved recordings.


                                                         As soon as the recording is submitted, the admin-
                                                         istrator of the system receives a notification (via
                                                         the bot) with the new recording and is asked to ap-
                                                         prove or reject the contribution. Typical causes of
                                                         rejection are too much background noise and ex-
                                                         plicitly offensive utterances. In case of approval,
                                                         the recording is inserted in the database and be-
                                                         comes readily available to other users in the lis-
                                                         tening mode.14
 Figure 1: Screenshot of the DialettiBot system.
                                                            In addition to the bot application, we developed
                                                         a web application15 (see figure 2) for visualizing
                                                         the approved recordings in a map and giving the
of the most recent recordings. As an element of
                                                         possibility to click on each of them to listen to the
gamification (Lafourcade et al., 2015), there is the
                                                         audio and read the translation.
possibility to ask for a random recording and try to
guess its location. The user would then receive a        3.1    Technical Specification
feedback about the distance between the guessed
location and the correct one. With this simple           The bot is implemented in Python using the tele-
game we gather valuable data that would enable           gram bot API.16 We chose to deploy the system via
us to plot a type of confusability matrix between        a chatbot (as opposed to a mobile app or web ap-
dialects, i.e., how much a dialect of place A re-        plication) because it is much faster to build and to
sembles a dialect of place B.                            maintain since all the major functionalities (voice
                                                         recordings, GPS location) are already embedded
   In the recording mode, the user is asked to sub-      in the chat application and immediately accessi-
mit a freely chosen vocal recording of a sentence,       ble via simple API calls. Moreover, the system
that can be a simple phrase or a proverb, typical for    works on all mobile and desktop platforms with-
their dialect. In addition, the user is asked to indi-   out the need to build system-specific versions. Fi-
cate the place where the dialect comes from (either
                                                            14
by sending a GPS location or inputting the name of             In the future, there is a possibility to implement an addi-
                                                         tional validation step where other users or experts might flag
the place – in case the user is not currently located    some contribution as not being representative of a dialect.
in the place associated with the dialect), and op-          15
                                                               http://dialectbot.appspot.com/audiomap/mappa.html
                                                            16
tionally the translation of the recording in Italian.          https://core.telegram.org/bots/api
nally, the simplified interface of a chatbot is par-
ticularly suitable to elderly people which are one
of the most valuable target groups of the project,
and can be easily used for recording other people
while traveling also in case of no data connection
(recordings are saved locally and uploaded to the
server when data connection is again available).
   The server behind DialettiBot is hosted by the
Google Application Engine (GAE) framework and
the data is stored in the integration datastore. The
GAE technology guarantees full scalability up to
an unrestricted number of users which could en-
able producing a significantly large volume of
recordings. The same system also serves the web
application with the map of the recordings illus-
trated in figure 2, which has been implemented in
javascript using the Leaflet17 library.

4        Collected data
The first version of DialettiBot has been deployed
in January 2016. Since then, 1,886 users have in-
teracted with the system and have submitted 255        Figure 3: Frequency of approved recordings col-
voice recordings out of which 220 have been ap-        lected over time.
proved.18 About 14% of users who interacted with
the system contributed a recording.
   Figure 3 shows the bar chart with the distribu-
tion of the approved recordings over time. The
plot shows that the number of contributions in
2017 (31) has been significantly lower than in
2016 (117) , whereas in 2018 the number is in-
creasing again (72 in the first 3 quarters of the
year).
   Figure 4 shows the distribution of the approved
recordings on the map of Italy, with the counts
clustered by proximity (heat map). Campania is
the region with most recordings (38), followed
by Lazio (35), Trentino-South Tyrol and Sicily
(27), Puglia (22), Veneto (15), Piedmont and Tus-
cany (12), Calabria and Lombardy (9), Basilicata
(5), Emilia-Romagna, Friuli-Venezia Giulia and
Marches (2), Abruzzo, Molise and Sardinia (1).
Currently we have no recordings from Liguria,
Umbria and Valle d’Aosta.

5        Conclusions and future work
We have presented DialettiBot, a chatbot sys-
tem based on Telegram for crowdsourcing geo-
referenced recordings of Italian dialects.             Figure 4: Heat map of the approved recordings.19
    17
       https://leafletjs.com
    18
       As of 31st of September 2018.
    19
       Created via https://mapmakerapp.com.
   Preliminary tests show that the system can be             Acknowledgments
easily used by anyone who wishes to collect data
                                                             We kindly acknowledge all users who have so
in the field as well as the dialect speakers them-
                                                             far contributed to the project by providing audio
selves. The recording quality is good and the data
                                                             recordings of their dialects, and the three anony-
is easily exportable to be used for further process-
                                                             mous reviewers for their useful feedback.
ing in the service of linguistic research or NLP ap-
plications. At the same time, the current state of           References
the project suffers from a number of limitations
that need to be addressed in future work and that            A. Belletti. 1993. Syntactic Theory and the Di-
we discuss next.                                               alects of Italy. Volume 9 of Linguistica (Turin,
   First, the preliminary tests have not been in-              Italy). Rosenberg & Sellier.
formed by a detailed linguistic study of dialectical         Roberta D’Alessandro, Adam Ledgeway, Ian
variation nor have we implemented a methodol-                  Roberts, and Frank Nuessel. 2010. Syntactic
ogy for data collection. This is because the tests             Variation: The Dialects of Italy. Cambridge
have been carried out as a proof-of-concept for                University Press.
the technology used to collect linguistic resources          Jules Gilliéron and Ed. Edmont. 1902. Atlas lin-
rather than a full-fledged linguistic project. Future          guistique de la France,. H. Champion„ Paris,.
tests will require a more careful consideration for
                                                             Iryna Gurevych and Torsten Zesch. 2013. Collec-
dialect characteristics in the Italian language, the
                                                                tive intelligence and language resources: Intro-
type of data that would be most valuable (sponta-
                                                                duction to the special issue on collaboratively
neous speech vs a set of set sentences etc.) and a
                                                                constructed language resources. Lang. Resour.
construction of precise, reproducible instructions
                                                                Eval., 47(1):1–7.
for the contributors.
                                                             Hall, Damien. 2012. Vers un nouvel atlas linguis-
   Second, as described in section 3, we make use
                                                               tique de la france. SHS Web of Conferences,
of a centralized validation procedure to approve a
                                                               1:2171–2189.
subset of recordings. However, since we have no
complete knowledge of all Italian dialects we may            Mahmood Hosseini, Keith Phalp, Jacqui Taylor,
end up accepting recordings which are not mapped              and Raian Ali. 2014. The four pillars of crowd-
to the correct location. In the future, we would like         sourcing: a reference model. IEEE Eighth In-
to decentralize the procedure, by delegating the              ternational Conference on Research Challenges
approval to a higher number of volunteers spread              in Information Science.
out in all the regions, so that each new recording           Dieter Kattenbusch, Carola Köhler, Marcel Lucas
will get validated by the closest volunteer.                   Müller, and Fabio Tosques. 1998. VIVALDI
   Finally, the number of users and recordings col-            project: Vivaio acustico delle lingue e di dialetti
lected so far is relatively modest. This is due to             d’italia. https://www2.hu-berlin.de/vivaldi.
the fact that no effort has been undertaken so far to        M. Lafourcade, A. Joubert, and N.L. Brun. 2015.
promote its use by researchers or the general pub-            Games with a Purpose (GWAPS). Focus Series
lic. Accordingly, the current goal of the project             in Cognitive Science and Knowledge Manage-
is to get support from cultural institutions (both at         ment. Wiley.
a local and at a national level) to help us engage
                                                             A.L. Lepschy and G.C. Lepschy. 1992. The Italian
the citizens in this crowdsourcing effort, as well
                                                               Language Today. Hutchinson university library.
as academic partners to further refine the method-
                                                               Routledge.
ology and extend the chatbot capabilities.
   We believe this project could contribute to help          Lori Repetti. 2000. Phonological Theory and the
safeguard the Italian dialectic richness and collect           Dialects of Italy. John Benjamins Publishing
useful resources for NLP applications, as we in-               Company.
tend to make all recordings openly available for             Robert Simpson, Kevin R. Page, and David
other researchers to use.20                                    De Roure. 2014. Zooniverse: Observing the
                                                               world’s largest citizen science platform. In
  20
     We are planning to upload the data to the Common Lan-     Proceedings of the Companion Publication of
guage Resources Infrastructure (CLARIN).                       the 23rd International Conference on World
  Wide Web Companion, WWW Companion ’14,
  pages 1049–1054. International World Wide
  Web Conferences Steering Committee, Repub-
  lic and Canton of Geneva, Switzerland.
Mitja Skubic. 2000. Ladinia linguistica in una
 monumentale opera: Atlante linguistico del
 ladino dolomitico e dei dialetti limitrofi - ald-
 1, dr. ludwig reichert verlag, wiesbaden 1998.
 Linguistica, 40(1):188–195.