The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts Jennifer-Carmen Frey Aivars Glaznieks Egon W. Stemle Institute for Specialised Communication and Multilingualism EURAC Research Bolzano/Bozen, Italy {jennifer.frey,aivars.glaznieks,egon.stemle}@eurac.edu Abstract tri: L1, genere, età, istruzione e modalità di comunicazione via Internet) attraverso English. The DiDi corpus of South Ty- un questionario e contiene ulteriori an- rolean data of computer-mediated com- notazioni linguistiche relative a fenomeni munication (CMC) is a multilingual so- legati alla CMC e agli usi di varietà lin- ciolinguistic language corpus. It consists guistiche. Il corpus anonimizzato è libera- of around 600,000 tokens collected from mente disponibile a fini di ricerca. 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, 1 The DiDi Project the main languages of the corpus are Ger- man and Italian (followed by English). The autonomous Italian province of South Ty- The data has been manually anonymised rol is characterized by a multilingual environment and provides manually corrected part-of- with three official languages (Italian, German, and speech tags for the Italian language texts Ladin), an institutional bi- or trilingualism (de- and manually normalised data for German pending on the percentage of the Ladin popula- texts. Moreover, it is annotated with user- tion), and diverse individual language repertoires provided socio-demographic data (among (Ciccolone, 2010). others L1, gender, age, education, and in- In the regionally funded DiDi project,1 the goal ternet communication habits) from a ques- was to build a South Tyrolean CMC corpus to doc- tionnaire, and linguistic annotations re- ument the current language use of residents and garding CMC phenomena, languages and to analyse it socio-linguistically with a focus on varieties. The anonymised corpus is freely age. The project initially focused on the German- available for research purposes. speaking language group. However, all informa- tion regarding the project, e.g. the invitation to Italiano. DiDi è un corpus di comu- participate, the privacy agreement, the project web nicazione mediata dal computer (CMC), site, and the questionnaire for socio-demographic che raccoglie dati linguistici di area data was published in German and Italian. Hence, sudtirolese. Il corpus, multilingue e we attracted speakers of both Italian and German. sociolinguistico, è composto da circa Accordingly, the collected data is multilingual, 600,000 occorrenze raccolte (previo con- with major parts in German but with a substantial senso all’utilizzo dei dati) dai profili di portion in Italian (100,000 of 600,000 tokens). 136 iscritti a Facebook e residenti in Alto The collected multilingual CMC corpus com- Adige. Le principali lingue del corpus, bines Facebook status updates, comments, and tedesco e italiano (seguite dall’inglese), private messages with socio-demographic data riflettono lo spazio plurilingue del ter- (e.g. language biography, internet usage habits, ritorio. I dati sono stati manualmente and general parameters like age, gender, level anonimizzati e i testi in lingua italiana of education) of the writers. The data was en- sono corredati da etichette (manualmente riched with linguistic annotations on thread, text corrette) per le parti del discorso. In- and token level including language-specific part- oltre, DiDi è annotato con dati sociode- 1 mografici forniti dall’utente (fra gli al- For further information see www.eurac.edu/didi. of-speech (PoS) and lemma information, normali- With the consent of each participant, the data sation, and language identification. was downloaded via the Facebook Graph API4 In this paper, we describe the corpus with re- and from the used questionnaire service5 , and spect to its multilingual characteristics and give stored in a local MongoDB6 data base. Both enti- special emphasis to the Italian part of the corpus ties were linked via randomised unique identifiers. to which we added manually corrected PoS anno- A python interface provided access points to re- tations. Hence, it presents a continuation of Frey trieve user and text data from the data base in a et al. (2015) which was restricted to German texts linked and structured format, and also allowed to of the corpus, not taking into account the full vari- rebuild the conversational structure of threads by ety of data collected for the total corpus. linking successive text objects together. This in- formation can now be used to analyse turn-taking 2 Corpus Construction and language choices within threads.7 For the purpose of the DiDi project, we col- 3 Corpus Annotations lected language data from social networking sites (SNS) and combined it with socio-demographic This section describes the annotations added dur- data about the writers obtained from a question- ing the process of corpus construction.8 naire. We chose to collect data from Facebook 3.1 Socio-demographic Information about as this SNS is well known in South Tyrol, hosts a Participants wide variety of different communication settings, and is used over the whole territory by nearly all The corpus provides the following socio- groups of the society. demographic information about the participants Related research mainly draws on public data obtained from the online questionnaire: gender, such as public Facebook groups, Twitter or education, employment, internet communication chat data (e.g. Celli and Polonio (2013), Basile habits, communication devices in use, internet and Nissim (2013), Burghardt et al. (2016), experience, first language(s) (L1), and usage of a Beißwenger (2013)), excluding the possibility to South Tyrolean German or Italian dialect and its analyse discourse patterns of non-public every- particular origin. day language use. 3.2 Linguistic Annotation of Texts Collecting non-public and personal data for the The corpus was annotated on text and token level DiDi corpus raised technical issues regarding Ital- with a series of information. ian privacy regulations (which require user con- sent incl. privacy statement), the time-saving ac- • Language identification: quisition of authentic and complete language data, The used languages of a text were identified and the assignment of language data to question- in a semi-automatic approach: Firstly, us- naire data. These issues have been solved by de- ing the language identification tool langid.py veloping a Facebook application2 that allowed for (Lui and Baldwin, 2012), and secondly, man- the gathering of all three sorts of data (user con- ually correcting short texts and texts with a sent, language data, questionnaire data) at once. low confidence score. In addition, the application was easy to share via Facebook which helped to promote the project and • Tokenisation: to reach many potential participants. While data The corpus was tokenized with the Twit- collection was solely managed by the Facebook ter tokenizer ark-twokenize-py9 and subse- 4 application, we relied on Facebook’s in-platform https://developers.facebook.com/docs/ means (i.e. users’ sharing and liking) to recruit graph-api 5 http://www.objectplanet.com/opinio/ participants. In order to reach older users (> 50 6 https://www.mongodb.com/ years) it was necessary to additionally resort to 7 The source code is available at https: Facebook advertisment.3 //bitbucket.org/commul/didi_proxy. 8 See Frey et al. (2015) for detailed information on the 2 The source code is available at https: anonymisation procedure and the normalisation and process- //bitbucket.org/commul/didi_app. ing of German texts, including identification of languages and 3 varieties. For details regarding the technical and strategical design 9 of the data collection and methods of user recruitment see https://github.com/myleott/ Frey et al. (2014). ark-twokenize-py quently corrected manually for non-standard political or non-political according to a list language tokenisation issues. of politicians, political parties and political terms. • Part-of-speech tagging and lemmatization: (Corrected) tokens were annotated with PoS 3.3 Conversation-related Annotations tags and lemma information considering the We rebuilt conversation threads by linking suc- predominant language of the text at hand. cessive texts and created thread objects contain- We tagged Italian texts with the Italian tag ing ordered lists of texts that are accessible via the set of the Universal Dependencies project10 Python interface. Thread objects contain informa- using the RDR PoS Tagger (Nguyen et al., tion about the used languages and the number of 2014). Subsequently, we manually corrected active interlocutors and recipients of a message as PoS annotations to handle bad tagging accu- well as the time passed between two texts. racy for social media texts. Additionally, we As described in Frey et al. (2015), no text con- used the TreeTagger (Schmid, 1994; Schmid, tent of non-participants of the DiDi project was 1995) to assign PoS tags for German, En- stored, but general information about the publish- glish, Spanish, French and Portuguese texts ing time and the language of the text was kept. If applying the standard tagsets for each lan- all interlocutors of a thread were participants of guage. No manual correction was performed the project, the whole conversation is available. for these languages. • Normalisation: 3.4 User-related Annotations So far, we have manually normalised non- In addition to socio-demographic data, we added standard language to word-by-word standard information about the users’ (multilingual) com- transcriptions only for German texts. municational behaviour, i.e. their primary lan- guage, used languages and the number of inter- • Variety of German: locutors. We classified German texts as dialect, non- dialect or unclassifiable texts applying a 4 Corpus Data heuristic approach based on the normalisa- tion. 4.1 Corpus Size • Untranslatable dialect lexemes: The DiDi corpus comprises public and non-public We have created a lexicon for untranslatable language data of 136 South Tyrolean Facebook dialect words encountered during manual users. The users could choose to provide either normalisation. The dialect lexicon was used their Facebook wall communication (status up- to post-process out-of-vocabulary (OOV) to- dates and comments), their chat (i.e. private mes- kens in the corpus. sages) communication or both. In the end, 50 peo- ple provided access to both types of data. 80 users • Foreign language insertions: only provided access to their Facebook wall and 6 The most common OOV tokens that we man- users gave us only their chat communication. In ually classified as foreign language vocab- total, the corpus consists of around 600 thousand ulary have been annotated with information tokens that are distributed over the text categories about their language origin. status updates (172 ,66 tokens), comments (94,512 tokens) and chat messages (328,796 tokens). • CMC phenomena: Emoticons, emojis, @mentions, hashtags, 4.2 Multilingualism in the Corpus hyperlinks, and iterations of graphemes and The corpus is highly multilingual. Although the punctuation marks were annotated automati- initial intention of the project was to document cally using regular expressions. the use of German in South Tyrol, German lan- • Topic of the text: guage content comprises only 58% of the corpus. In order to investigate context factors of lan- 13% are written in Italian and 4% in English (the guage choice we annotated texts as either remainder of the messages was either classified 10 http://universaldependencies.org/it/ as unidentifiable language, non-language or other pos/index.html language). The distribution of the languages is based on the language backgrounds of the partic- ments and private messages (c.f. Table 4) shows ipants and is comparable to the multilingual com- that the respective L1 is preferred in all messages munity of South Tyrol. The following tables show types. We find the highest percentage of second the distribution of profiles, texts and tokens (table or foreign language use in status updates, whereas 1) and text type (table 2) by L1. in comments and private messages around 75% of the texts are written in L1. User L1 Profiles Texts Tokens IT 9 4,260 80,368 Text written as L1 as L2 DE 108 29,883 421,262 Status updates 6,774 (61%) 3,032 (27%) other 3 407 8,643 Comments 5,089 (78%) 924 (14%) IT + DE 11 4,165 75,359 Messages 16,257 (73%) 3,886 (17%) DE + other 5 1,110 10,642 Total 28,120 (71%) 7,842 (20%) Total 136 39,825 596,274 Table 4: Distribution of L1 and L2 use by text Table 1: Distribution of profiles, texts and tokens types. by L1. Finally, we observed 4,295 code-switching in- stances on conversation level and at least 1,653 User L1 SU CO PM texts that contain multiple languages11 . The aver- IT 1,682 1,063 1,515 age number of code-switching instances per user DE 7,286 4,890 17,707 is 10%, meaning that every tenth text does not other 172 45 190 continue the language of the previous text in the IT+DE 1,962 343 2,791 thread (the maximum was around every second DE+other 1,031 166 13 text, i.e. 42%). The average proportion of text with Total 11,102 6,507 22,216 multiple languages per user is 4% (max. 25%). Table 2: Distribution of texts by text type (SU = status updated, CO = comments, PM = private 5 Issues in Corpus Creation messages) by L1. In addition to general issues of working with so- cial media texts (e.g. text processing on noisy, While very few users wrote only in their first short texts as described for example in (Baldwin language, most users used at least two (88%), very et al., 2013; Eisenstein, 2013)) , the high diver- often even three (73%) or more (51%) languages. sity in used languages and varieties in our corpus Table 3 shows the number and proportion of Ger- led to various restraints in corpus creation and pro- man, Italian and English texts written as first or cessing as cross-lingual annotation and informa- second/foreign language. tion extraction are still crucial problems in natural language processing. We tried to address the de- Text written as L1 as L2 mands of a multilingual corpus by providing lan- IT 4,761 (57%) 3,566 (42%) guage specific PoS tagging and by applying lan- DE 23,191 (99%) 170 (1%) guage independent annotations. We are aware of EN 166 (4%) 3,625 (96%) the fact that this is by no means sufficient to deal All languages 28,120 (78%) 7,842 (22%) with linguistic research questions that exceed lan- guage boundaries. Moreover, manual correction Table 3: Distribution of text language by L1 or L2 tasks occupied a significant part of the work on the use. corpus as automatic annotation (e.g. for language identification) does not yet provide the accuracy In terms of multilingual language use in the expected for linguistic studies (Carter et al., 2013; DiDi corpus, we observe a slight difference be- Lui and Baldwin, 2014). tween Italian and German-speaking users. L1 Ital- ian speakers stick more to their L1 compared to the 11 Texts were annotated as mixed-language texts during the German-speaking participants, who are character- correction of the language identification, therefore this an- notation has not been done for the whole corpus. A further ized by a higher usage of L2 Italian. The compar- word-level identification of languages could detect even more ison of L1 and L2 usage in status updates, com- mixed-language content(Nguyen and Dogruoz, 2013) 6 Conclusion and Future Work Simone Ciccolone. 2010. Lo standard tedesco in Alto Adige. Il segno e le lettere. LED Edizioni Universi- In this paper we presented a freely avail- tarie, Milan. able language corpus of Facebook user profiles Jacob Eisenstein. 2013. What to do about bad lan- from South Tyrol, Italy. The multilingual cor- guage on the internet. In Proceedings of NAACL- pus is anonymised and annotated with socio- HLT, pages 359–369. demographic data of users, language specific (and Jennifer-Carmen Frey, Egon W. Stemle, and Aivars for Italian manually corrected) PoS tags, lem- Glaznieks. 2014. Collecting language data of non- mas and linguistic annotations mainly related to public social media profiles. In Gertrud Faaß and used languages, varieties and multilingual phe- Josef Ruppenhofer, editors, Workshop Proceedings nomena. The corpus is accessible for querying of the 12th Edition of the KONVENS Conference, pages 11–15, Hildesheim, Germany, October. Uni- via ANNIS12 or can be obtained as processable versitatsverlag Hildesheim, Germany. data for research purposes on http://www. eurac.edu/didi. Jennifer-Carmen Frey, Egon W. Stemle, and Aivars Glaznieks. 2015. The DiDi Corpus of South Ty- rolean CMC Data. In Workshop Proceedings of the Acknowledgements 2nd Workshop on NLP4CMC at GSCL2015. The project was financed by the Provincia au- Marco Lui and Timothy Baldwin. 2012. langid.py: An tonoma di Bolzano – Alto Adige, Ripartizione off-the-shelf language identification tool. In Pro- Diritto allo studio, università e ricerca scien- ceedings of the ACL 2012 system demonstrations, tifica, Legge provinciale 13 dicembre 2006, n. 14 pages 25–30. Association for Computational Lin- guistics. ”Ricerca e innovazione”. Marco Lui and Timothy Baldwin. 2014. Accurate lan- guage identification of twitter messages. In Pro- References ceedings of the 5th Workshop on Language Anal- ysis for Social Media (LASM)@ EACL, pages 17– Timothy Baldwin, Paul Cook, Marco Lui, Andrew 25, Gothenburg. Association for Computational Lin- MacKinlay, and Li Wang. 2013. How noisy social guistics. media text, how diffrnt social media sources. In Pro- ceedings of the Sixth International Joint Conference Dong-Phuong Nguyen and A Seza Dogruoz. 2013. on Natural Language Processing, pages 356–364. Word level language identification in online multi- lingual communication. Association for Computa- Valerio Basile and Malvina Nissim. 2013. Sentiment tional Linguistics. analysis on Italian tweets. In Proceedings of the 4th Workshop on Computational Approaches to Subjec- Dat Quoc Nguyen, Dang Duc Pham Dai Quoc Nguyen, tivity, Sentiment and Social Media Analysis, pages and Son Bao Pham. 2014. RDRPOSTagger: A rip- 100–107. ple down rules-based part-of-speech tagger. In Pro- ceedings of the Demonstrations at the 14th Confer- Michael Beißwenger. 2013. Das Dortmunder Chat- ence of the European Chapter of the Association for Korpus. Zeitschrift für germanistische Linguistik, Computational Linguistics, pages 17–20. 41(1):161–164. Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the Manuel Burghardt, Daniel Granvogl, and Christian international conference on new methods in lan- Wolff. 2016. Creating a Lexicon of Bavarian Di- guage processing, volume 12, pages 44–49. alect by Means of Facebook Language Data and Crowdsourcing. In Proceedings of LREC 2016, Helmut Schmid. 1995. Improvements in part-of- pages 2029–2033. speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop. Simon Carter, Wouter Weerkamp, and Manos Tsagkias. 2013. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, 47(1):195–215. Fabio Celli and Luca Polonio. 2013. Relationships be- tween personality and interactions in facebook. So- cial Networking: Recent Trends, Emerging Issues and Future Outlook, pages 41–54. 12 http://annis-tools.org/