DialettiBot: a Telegram Bot for Crowdsourcing Recordings of Italian Dialects Federico Sangati Ekaterina Abramova Johanna Monti University L’Orientale Nijmegen University University L’Orientale Naples, Italy The Netherlands Naples, Italy fsangati@unior.it e.abramova@ftr.ru.nl jmonti@unior.it Abstract 1 Introduction English. In this paper we describe Dialet- It is commonly known that Italy has an abundance tiBot, a Telegram based chatbot for crowd- of different dialects, such as Florentine, Venetian, sourcing geo-referenced voice recordings and Neapolitan. These dialects are not only char- of Italian dialects. The system enables acterized by simple phonetic variation as it is usu- people to listen to previously recorded au- ally meant by this term, but they are proper Ro- dio and encourages them to contribute to mance languages, with a fully developed grammar building a collective linguistic resource by and lexicon. As Repetti puts it: sending voice recordings of their own spo- The Italian ‘dialects’ [...] are daughter ken dialects. The project aims at collecting languages of Latin and sister languages a large sample of voice recordings in order of each other, of standard Italian, and of to promote knowledge of linguistic varia- other Romance languages, and they may tion and preserve proverbs or idioms typi- be as different from each other and from cal for different local dialects. Moreover, standard Italian as French is from Por- the collected data can contribute to several tuguese. (Repetti, 2000) voice-based Natural Language Processing (NLP) applications in helping them under- This dialectical variety is a resource that de- stand utterances in non-standard Italian. serves to be studied and preserved for both cul- tural and applied reasons. The former, because Italiano. it is quickly disappearing with less and less peo- In questo articolo descriviamo Dialet- ple who regularly use dialect at home and in pub- tiBot, un chatbot basato su Telegram lic places. According to UNESCO “Atlas of the per raccogliere registrazioni audio geo- World’s Languages in Danger”,1 there are about referenziate di dialetti italiani. Il sistema 2,500 endangered languages worldwide. In Italy, permette alle persone di ascoltare le reg- thirty dialects are at risk of extinction, such as friu- istrazioni precedentemente inserite, e le lano, ladino and veneciano.2 The applied motiva- incoraggia a contribuire alla costruzione tion is that in recent years we have witnessed a sig- di questa risorsa linguistica collettiva, nificant growth in the number of voice-based NLP attraverso l’invio di registrazioni audio applications (such as virtual assistants), which are nel proprio dialetto. Il progetto mira currently not trained on local dialects and there- a raccogliere una grande mole di regi- fore perform poorly with a number of Italian strazioni che possono aiutare a promuo- speakers. vere la conoscenza delle variazioni lin- In this paper we present a freely available tool guistiche e la salvaguardia dei proverbi o that enables geo-referenced recording of Italian modi di dire tipici di ogni dialetto locale. dialects: DialettiBot, a Telegram based chatbot, I dati raccolti possono inoltre contribuire whose aim is to collect a large sample of voice a diverse applicazioni del trattamento au- recordings, promoting preservation of linguistic tomatico del linguaggio (TAL) che hanno 1 http://www.unesco.org/languages-atlas bisogno di essere adattate per compren- 2 http://www.culturaitalia.it/opencms/en/contenuti/focus/ dere espressioni dialettali. UNESCO_warns_that_thirty_Italian_dialects_are_at_risk_ of_extinction.html?language=en variation and its use in NLP applications. The rest as heard around the world. With roughly of the paper is organized as follows: in section 2 1,400 samples from 120 countries and territo- we describe related work, in section 3 the imple- ries, and more than 170 hours of recordings, mented system and in section 4 the collected data. IDEA is now the largest archive of its kind.6 2 Related work MICROCONTACT aims at developing a theory of syntactic change by observing the evolu- There has been an extensive linguistic research of tion of the dialects spoken by Italians who Italian dialects (Lepschy and Lepschy, 1992; Bel- have migrated to North and South America letti, 1993; D’Alessandro et al., 2010). Here we during the 20th century.7 summarize a number of projects that relate to the idea of gathering linguistic recordings for produc- SPEAKUNIQUE and VOCALID are two sim- ing a map of dialects. We also point out their lim- ilar projects that aim at collecting English itations that inspire our project. voice sample from different regions for cre- ating personalized digital voices for commu- VIVALDI project the “Vivaio Acustico delle nication text to speech devices.8 Lingue e dei Dialetti d’Italia” is a collec- tion of recordings and transcriptions of fixed Our project aims to be an updated and contin- phrases in the dialects of different cities from uously evolving initiative that can capture sponta- all regions in Italy (Kattenbusch et al., 1998). neous (living) dialectical variation over the whole Unfortunately, the project is no longer active Italian territory by being freely accessible and easy and has mainly focused on a finite set of cho- to use for a variety of non-specialists. As such, sen sentences, as opposed to spontaneous ut- the project follows methodological practices simi- terances. lar to other citizen-science projects (Gurevych and Zesch, 2013; Simpson et al., 2014; Hosseini et al., LOCALINGUAL A web application for crowd- 2014), it incorporates a GWAP9 feature (Lafour- sourcing recordings from around the world. cade et al., 2015), and fits within the line of ‘ex- This project is the one that most closely re- plicit crowdsourcing’ as defined by the EnetCol- lates to ours. The main difference is that it is lect10 COST11 action. not restricted to a specific country, does not use geo-locations and works via a web appli- 3 System description cation, which makes it difficult to be used on mobile devices or in case of poor data con- In order to crowdsource recordings from Italian di- nection.3 alects, we have built a Telegram chatbot: Dialet- tiBot.12 As shown in the screenshot in figure 1, ALF Atlas Linguistique de la France: an in- the user can interact with the bot via a standard fluential dialect atlas of Romance varieties dialogue chat interface in a Telegram application in France published in 13 volumes between which is freely available for all mobile or desktop 1902 and 1910 (Gilliéron and Edmont, 1902). operating systems.13 Apart from textual input, the An example of more recent work of this type interface provides a small keyboard of buttons that is Hall, Damien (2012).4 changes during the dialogue flow to simplify the interaction. In addition, the bot is able to accept ALD Linguistic Atlas of Dolomitic Ladinian and vocal recordings and GPS locations. neighbouring Dialects (Skubic, 2000). The The bot gives the possibility to the user to listen project studies the linguistic variation be- to approved recordings or to add new ones. tween dialects of the region which covers the In the listening mode, it is possible to search Grisons and Friuli region.5 for recording based on location or view the list IDEA The International Dialects of English 6 https://www.dialectsarchive.com Archive was created in 1998 as the inter- 7 https://microcontact.sites.uu.nl/project 8 net’s first archive of primary-source record- https://www.speakunique.org, https://www.vocalid.co 9 ings of English-language dialects and accents Game with a purpose. 10 http://enetcollect.eurac.edu 3 11 https://localingual.com European Cooperation in Science and Technology. 4 12 http://cartodialect.imag.fr/cartoDialect/accueil https://t.me/dialettibot 5 13 https://www.micura.it/en/activities/ald-linguistic-atlas https://telegram.org/apps Figure 2: Screenshot of the web application dis- playing the audio map of the approved recordings. As soon as the recording is submitted, the admin- istrator of the system receives a notification (via the bot) with the new recording and is asked to ap- prove or reject the contribution. Typical causes of rejection are too much background noise and ex- plicitly offensive utterances. In case of approval, the recording is inserted in the database and be- comes readily available to other users in the lis- tening mode.14 Figure 1: Screenshot of the DialettiBot system. In addition to the bot application, we developed a web application15 (see figure 2) for visualizing the approved recordings in a map and giving the of the most recent recordings. As an element of possibility to click on each of them to listen to the gamification (Lafourcade et al., 2015), there is the audio and read the translation. possibility to ask for a random recording and try to guess its location. The user would then receive a 3.1 Technical Specification feedback about the distance between the guessed location and the correct one. With this simple The bot is implemented in Python using the tele- game we gather valuable data that would enable gram bot API.16 We chose to deploy the system via us to plot a type of confusability matrix between a chatbot (as opposed to a mobile app or web ap- dialects, i.e., how much a dialect of place A re- plication) because it is much faster to build and to sembles a dialect of place B. maintain since all the major functionalities (voice recordings, GPS location) are already embedded In the recording mode, the user is asked to sub- in the chat application and immediately accessi- mit a freely chosen vocal recording of a sentence, ble via simple API calls. Moreover, the system that can be a simple phrase or a proverb, typical for works on all mobile and desktop platforms with- their dialect. In addition, the user is asked to indi- out the need to build system-specific versions. Fi- cate the place where the dialect comes from (either 14 by sending a GPS location or inputting the name of In the future, there is a possibility to implement an addi- tional validation step where other users or experts might flag the place – in case the user is not currently located some contribution as not being representative of a dialect. in the place associated with the dialect), and op- 15 http://dialectbot.appspot.com/audiomap/mappa.html 16 tionally the translation of the recording in Italian. https://core.telegram.org/bots/api nally, the simplified interface of a chatbot is par- ticularly suitable to elderly people which are one of the most valuable target groups of the project, and can be easily used for recording other people while traveling also in case of no data connection (recordings are saved locally and uploaded to the server when data connection is again available). The server behind DialettiBot is hosted by the Google Application Engine (GAE) framework and the data is stored in the integration datastore. The GAE technology guarantees full scalability up to an unrestricted number of users which could en- able producing a significantly large volume of recordings. The same system also serves the web application with the map of the recordings illus- trated in figure 2, which has been implemented in javascript using the Leaflet17 library. 4 Collected data The first version of DialettiBot has been deployed in January 2016. Since then, 1,886 users have in- teracted with the system and have submitted 255 Figure 3: Frequency of approved recordings col- voice recordings out of which 220 have been ap- lected over time. proved.18 About 14% of users who interacted with the system contributed a recording. Figure 3 shows the bar chart with the distribu- tion of the approved recordings over time. The plot shows that the number of contributions in 2017 (31) has been significantly lower than in 2016 (117) , whereas in 2018 the number is in- creasing again (72 in the first 3 quarters of the year). Figure 4 shows the distribution of the approved recordings on the map of Italy, with the counts clustered by proximity (heat map). Campania is the region with most recordings (38), followed by Lazio (35), Trentino-South Tyrol and Sicily (27), Puglia (22), Veneto (15), Piedmont and Tus- cany (12), Calabria and Lombardy (9), Basilicata (5), Emilia-Romagna, Friuli-Venezia Giulia and Marches (2), Abruzzo, Molise and Sardinia (1). Currently we have no recordings from Liguria, Umbria and Valle d’Aosta. 5 Conclusions and future work We have presented DialettiBot, a chatbot sys- tem based on Telegram for crowdsourcing geo- referenced recordings of Italian dialects. Figure 4: Heat map of the approved recordings.19 17 https://leafletjs.com 18 As of 31st of September 2018. 19 Created via https://mapmakerapp.com. Preliminary tests show that the system can be Acknowledgments easily used by anyone who wishes to collect data We kindly acknowledge all users who have so in the field as well as the dialect speakers them- far contributed to the project by providing audio selves. The recording quality is good and the data recordings of their dialects, and the three anony- is easily exportable to be used for further process- mous reviewers for their useful feedback. ing in the service of linguistic research or NLP ap- plications. At the same time, the current state of References the project suffers from a number of limitations that need to be addressed in future work and that A. Belletti. 1993. Syntactic Theory and the Di- we discuss next. alects of Italy. Volume 9 of Linguistica (Turin, First, the preliminary tests have not been in- Italy). Rosenberg & Sellier. formed by a detailed linguistic study of dialectical Roberta D’Alessandro, Adam Ledgeway, Ian variation nor have we implemented a methodol- Roberts, and Frank Nuessel. 2010. Syntactic ogy for data collection. This is because the tests Variation: The Dialects of Italy. Cambridge have been carried out as a proof-of-concept for University Press. the technology used to collect linguistic resources Jules Gilliéron and Ed. Edmont. 1902. Atlas lin- rather than a full-fledged linguistic project. Future guistique de la France,. H. Champion„ Paris,. tests will require a more careful consideration for Iryna Gurevych and Torsten Zesch. 2013. Collec- dialect characteristics in the Italian language, the tive intelligence and language resources: Intro- type of data that would be most valuable (sponta- duction to the special issue on collaboratively neous speech vs a set of set sentences etc.) and a constructed language resources. Lang. Resour. construction of precise, reproducible instructions Eval., 47(1):1–7. for the contributors. Hall, Damien. 2012. Vers un nouvel atlas linguis- Second, as described in section 3, we make use tique de la france. SHS Web of Conferences, of a centralized validation procedure to approve a 1:2171–2189. subset of recordings. However, since we have no complete knowledge of all Italian dialects we may Mahmood Hosseini, Keith Phalp, Jacqui Taylor, end up accepting recordings which are not mapped and Raian Ali. 2014. The four pillars of crowd- to the correct location. In the future, we would like sourcing: a reference model. IEEE Eighth In- to decentralize the procedure, by delegating the ternational Conference on Research Challenges approval to a higher number of volunteers spread in Information Science. out in all the regions, so that each new recording Dieter Kattenbusch, Carola Köhler, Marcel Lucas will get validated by the closest volunteer. Müller, and Fabio Tosques. 1998. VIVALDI Finally, the number of users and recordings col- project: Vivaio acustico delle lingue e di dialetti lected so far is relatively modest. This is due to d’italia. https://www2.hu-berlin.de/vivaldi. the fact that no effort has been undertaken so far to M. Lafourcade, A. Joubert, and N.L. Brun. 2015. promote its use by researchers or the general pub- Games with a Purpose (GWAPS). Focus Series lic. Accordingly, the current goal of the project in Cognitive Science and Knowledge Manage- is to get support from cultural institutions (both at ment. Wiley. a local and at a national level) to help us engage A.L. Lepschy and G.C. Lepschy. 1992. The Italian the citizens in this crowdsourcing effort, as well Language Today. Hutchinson university library. as academic partners to further refine the method- Routledge. ology and extend the chatbot capabilities. We believe this project could contribute to help Lori Repetti. 2000. Phonological Theory and the safeguard the Italian dialectic richness and collect Dialects of Italy. John Benjamins Publishing useful resources for NLP applications, as we in- Company. tend to make all recordings openly available for Robert Simpson, Kevin R. Page, and David other researchers to use.20 De Roure. 2014. Zooniverse: Observing the world’s largest citizen science platform. In 20 We are planning to upload the data to the Common Lan- Proceedings of the Companion Publication of guage Resources Infrastructure (CLARIN). the 23rd International Conference on World Wide Web Companion, WWW Companion ’14, pages 1049–1054. International World Wide Web Conferences Steering Committee, Repub- lic and Canton of Geneva, Switzerland. Mitja Skubic. 2000. Ladinia linguistica in una monumentale opera: Atlante linguistico del ladino dolomitico e dei dialetti limitrofi - ald- 1, dr. ludwig reichert verlag, wiesbaden 1998. Linguistica, 40(1):188–195.