The DiDi Corpus of South Tyrolean CMC Data:
                  A multilingual corpus of Facebook texts
Jennifer-Carmen Frey              Aivars Glaznieks             Egon W. Stemle
           Institute for Specialised Communication and Multilingualism
                                  EURAC Research
                                 Bolzano/Bozen, Italy
  {jennifer.frey,aivars.glaznieks,egon.stemle}@eurac.edu


                Abstract                               tri: L1, genere, età, istruzione e modalità
                                                       di comunicazione via Internet) attraverso
English. The DiDi corpus of South Ty-                  un questionario e contiene ulteriori an-
rolean data of computer-mediated com-                  notazioni linguistiche relative a fenomeni
munication (CMC) is a multilingual so-                 legati alla CMC e agli usi di varietà lin-
ciolinguistic language corpus. It consists             guistiche. Il corpus anonimizzato è libera-
of around 600,000 tokens collected from                mente disponibile a fini di ricerca.
136 profiles of Facebook users residing
in South Tyrol, Italy. In conformity with
the multilingual situation of the territory,   1       The DiDi Project
the main languages of the corpus are Ger-
man and Italian (followed by English).         The autonomous Italian province of South Ty-
The data has been manually anonymised          rol is characterized by a multilingual environment
and provides manually corrected part-of-       with three official languages (Italian, German, and
speech tags for the Italian language texts     Ladin), an institutional bi- or trilingualism (de-
and manually normalised data for German        pending on the percentage of the Ladin popula-
texts. Moreover, it is annotated with user-    tion), and diverse individual language repertoires
provided socio-demographic data (among         (Ciccolone, 2010).
others L1, gender, age, education, and in-        In the regionally funded DiDi project,1 the goal
ternet communication habits) from a ques-      was to build a South Tyrolean CMC corpus to doc-
tionnaire, and linguistic annotations re-      ument the current language use of residents and
garding CMC phenomena, languages and           to analyse it socio-linguistically with a focus on
varieties. The anonymised corpus is freely     age. The project initially focused on the German-
available for research purposes.               speaking language group. However, all informa-
                                               tion regarding the project, e.g. the invitation to
Italiano. DiDi è un corpus di comu-           participate, the privacy agreement, the project web
nicazione mediata dal computer (CMC),          site, and the questionnaire for socio-demographic
che raccoglie dati linguistici di area         data was published in German and Italian. Hence,
sudtirolese.    Il corpus, multilingue e       we attracted speakers of both Italian and German.
sociolinguistico, è composto da circa         Accordingly, the collected data is multilingual,
600,000 occorrenze raccolte (previo con-       with major parts in German but with a substantial
senso all’utilizzo dei dati) dai profili di    portion in Italian (100,000 of 600,000 tokens).
136 iscritti a Facebook e residenti in Alto
                                                  The collected multilingual CMC corpus com-
Adige. Le principali lingue del corpus,
                                               bines Facebook status updates, comments, and
tedesco e italiano (seguite dall’inglese),
                                               private messages with socio-demographic data
riflettono lo spazio plurilingue del ter-
                                               (e.g. language biography, internet usage habits,
ritorio. I dati sono stati manualmente
                                               and general parameters like age, gender, level
anonimizzati e i testi in lingua italiana
                                               of education) of the writers. The data was en-
sono corredati da etichette (manualmente
                                               riched with linguistic annotations on thread, text
corrette) per le parti del discorso. In-
                                               and token level including language-specific part-
oltre, DiDi è annotato con dati sociode-
                                                   1
mografici forniti dall’utente (fra gli al-             For further information see www.eurac.edu/didi.
of-speech (PoS) and lemma information, normali-                       With the consent of each participant, the data
sation, and language identification.                               was downloaded via the Facebook Graph API4
   In this paper, we describe the corpus with re-                  and from the used questionnaire service5 , and
spect to its multilingual characteristics and give                 stored in a local MongoDB6 data base. Both enti-
special emphasis to the Italian part of the corpus                 ties were linked via randomised unique identifiers.
to which we added manually corrected PoS anno-                     A python interface provided access points to re-
tations. Hence, it presents a continuation of Frey                 trieve user and text data from the data base in a
et al. (2015) which was restricted to German texts                 linked and structured format, and also allowed to
of the corpus, not taking into account the full vari-              rebuild the conversational structure of threads by
ety of data collected for the total corpus.                        linking successive text objects together. This in-
                                                                   formation can now be used to analyse turn-taking
2       Corpus Construction                                        and language choices within threads.7
For the purpose of the DiDi project, we col-                       3     Corpus Annotations
lected language data from social networking sites
(SNS) and combined it with socio-demographic                       This section describes the annotations added dur-
data about the writers obtained from a question-                   ing the process of corpus construction.8
naire. We chose to collect data from Facebook                      3.1    Socio-demographic Information about
as this SNS is well known in South Tyrol, hosts a                         Participants
wide variety of different communication settings,
and is used over the whole territory by nearly all                 The corpus provides the following socio-
groups of the society.                                             demographic information about the participants
   Related research mainly draws on public data                    obtained from the online questionnaire: gender,
such as public Facebook groups, Twitter or                         education, employment, internet communication
chat data (e.g. Celli and Polonio (2013), Basile                   habits, communication devices in use, internet
and Nissim (2013), Burghardt et al. (2016),                        experience, first language(s) (L1), and usage of a
Beißwenger (2013)), excluding the possibility to                   South Tyrolean German or Italian dialect and its
analyse discourse patterns of non-public every-                    particular origin.
day language use.                                                  3.2    Linguistic Annotation of Texts
   Collecting non-public and personal data for the
                                                                   The corpus was annotated on text and token level
DiDi corpus raised technical issues regarding Ital-
                                                                   with a series of information.
ian privacy regulations (which require user con-
sent incl. privacy statement), the time-saving ac-                     • Language identification:
quisition of authentic and complete language data,                       The used languages of a text were identified
and the assignment of language data to question-                         in a semi-automatic approach: Firstly, us-
naire data. These issues have been solved by de-                         ing the language identification tool langid.py
veloping a Facebook application2 that allowed for                        (Lui and Baldwin, 2012), and secondly, man-
the gathering of all three sorts of data (user con-                      ually correcting short texts and texts with a
sent, language data, questionnaire data) at once.                        low confidence score.
In addition, the application was easy to share via
Facebook which helped to promote the project and                       • Tokenisation:
to reach many potential participants. While data                         The corpus was tokenized with the Twit-
collection was solely managed by the Facebook                            ter tokenizer ark-twokenize-py9 and subse-
                                                                       4
application, we relied on Facebook’s in-platform                         https://developers.facebook.com/docs/
means (i.e. users’ sharing and liking) to recruit                  graph-api
                                                                       5
                                                                         http://www.objectplanet.com/opinio/
participants. In order to reach older users (> 50                      6
                                                                         https://www.mongodb.com/
years) it was necessary to additionally resort to                      7
                                                                         The source code is available at https:
Facebook advertisment.3                                            //bitbucket.org/commul/didi_proxy.
                                                                       8
                                                                         See Frey et al. (2015) for detailed information on the
    2
      The source code is available at https:                       anonymisation procedure and the normalisation and process-
//bitbucket.org/commul/didi_app.                                   ing of German texts, including identification of languages and
    3                                                              varieties.
      For details regarding the technical and strategical design
                                                                       9
of the data collection and methods of user recruitment see               https://github.com/myleott/
Frey et al. (2014).                                                ark-twokenize-py
    quently corrected manually for non-standard             political or non-political according to a list
    language tokenisation issues.                           of politicians, political parties and political
                                                            terms.
  • Part-of-speech tagging and lemmatization:
    (Corrected) tokens were annotated with PoS        3.3    Conversation-related Annotations
    tags and lemma information considering the
                                                      We rebuilt conversation threads by linking suc-
    predominant language of the text at hand.
                                                      cessive texts and created thread objects contain-
    We tagged Italian texts with the Italian tag
                                                      ing ordered lists of texts that are accessible via the
    set of the Universal Dependencies project10
                                                      Python interface. Thread objects contain informa-
    using the RDR PoS Tagger (Nguyen et al.,
                                                      tion about the used languages and the number of
    2014). Subsequently, we manually corrected
                                                      active interlocutors and recipients of a message as
    PoS annotations to handle bad tagging accu-
                                                      well as the time passed between two texts.
    racy for social media texts. Additionally, we
                                                         As described in Frey et al. (2015), no text con-
    used the TreeTagger (Schmid, 1994; Schmid,
                                                      tent of non-participants of the DiDi project was
    1995) to assign PoS tags for German, En-
                                                      stored, but general information about the publish-
    glish, Spanish, French and Portuguese texts
                                                      ing time and the language of the text was kept. If
    applying the standard tagsets for each lan-
                                                      all interlocutors of a thread were participants of
    guage. No manual correction was performed
                                                      the project, the whole conversation is available.
    for these languages.
  • Normalisation:                                    3.4    User-related Annotations
    So far, we have manually normalised non-          In addition to socio-demographic data, we added
    standard language to word-by-word standard        information about the users’ (multilingual) com-
    transcriptions only for German texts.             municational behaviour, i.e. their primary lan-
                                                      guage, used languages and the number of inter-
  • Variety of German:
                                                      locutors.
    We classified German texts as dialect, non-
    dialect or unclassifiable texts applying a        4     Corpus Data
    heuristic approach based on the normalisa-
    tion.                                             4.1    Corpus Size

  • Untranslatable dialect lexemes:                   The DiDi corpus comprises public and non-public
    We have created a lexicon for untranslatable      language data of 136 South Tyrolean Facebook
    dialect words encountered during manual           users. The users could choose to provide either
    normalisation. The dialect lexicon was used       their Facebook wall communication (status up-
    to post-process out-of-vocabulary (OOV) to-       dates and comments), their chat (i.e. private mes-
    kens in the corpus.                               sages) communication or both. In the end, 50 peo-
                                                      ple provided access to both types of data. 80 users
  • Foreign language insertions:                      only provided access to their Facebook wall and 6
    The most common OOV tokens that we man-           users gave us only their chat communication. In
    ually classified as foreign language vocab-       total, the corpus consists of around 600 thousand
    ulary have been annotated with information        tokens that are distributed over the text categories
    about their language origin.                      status updates (172 ,66 tokens), comments (94,512
                                                      tokens) and chat messages (328,796 tokens).
  • CMC phenomena:
    Emoticons, emojis, @mentions, hashtags,           4.2    Multilingualism in the Corpus
    hyperlinks, and iterations of graphemes and
                                                      The corpus is highly multilingual. Although the
    punctuation marks were annotated automati-
                                                      initial intention of the project was to document
    cally using regular expressions.
                                                      the use of German in South Tyrol, German lan-
  • Topic of the text:                                guage content comprises only 58% of the corpus.
    In order to investigate context factors of lan-   13% are written in Italian and 4% in English (the
    guage choice we annotated texts as either         remainder of the messages was either classified
  10
     http://universaldependencies.org/it/             as unidentifiable language, non-language or other
pos/index.html                                        language). The distribution of the languages is
based on the language backgrounds of the partic-        ments and private messages (c.f. Table 4) shows
ipants and is comparable to the multilingual com-       that the respective L1 is preferred in all messages
munity of South Tyrol. The following tables show        types. We find the highest percentage of second
the distribution of profiles, texts and tokens (table   or foreign language use in status updates, whereas
1) and text type (table 2) by L1.                       in comments and private messages around 75% of
                                                        the texts are written in L1.
   User L1        Profiles     Texts     Tokens
   IT                    9     4,260      80,368            Text written           as L1              as L2
   DE                 108     29,883     421,262            Status updates      6,774 (61%)        3,032 (27%)
   other                 3       407       8,643            Comments            5,089 (78%)          924 (14%)
   IT + DE             11      4,165      75,359            Messages           16,257 (73%)        3,886 (17%)
   DE + other            5     1,110      10,642            Total              28,120 (71%)        7,842 (20%)
   Total              136     39,825     596,274
                                                        Table 4: Distribution of L1 and L2 use by text
Table 1: Distribution of profiles, texts and tokens     types.
by L1.
                                                           Finally, we observed 4,295 code-switching in-
                                                        stances on conversation level and at least 1,653
      User L1       SU         CO        PM
                                                        texts that contain multiple languages11 . The aver-
      IT            1,682     1,063      1,515
                                                        age number of code-switching instances per user
      DE            7,286     4,890     17,707
                                                        is 10%, meaning that every tenth text does not
      other           172        45        190
                                                        continue the language of the previous text in the
      IT+DE         1,962       343      2,791
                                                        thread (the maximum was around every second
      DE+other      1,031       166         13
                                                        text, i.e. 42%). The average proportion of text with
      Total        11,102     6,507     22,216
                                                        multiple languages per user is 4% (max. 25%).
Table 2: Distribution of texts by text type (SU
= status updated, CO = comments, PM = private
                                                        5        Issues in Corpus Creation
messages) by L1.                                        In addition to general issues of working with so-
                                                        cial media texts (e.g. text processing on noisy,
   While very few users wrote only in their first       short texts as described for example in (Baldwin
language, most users used at least two (88%), very      et al., 2013; Eisenstein, 2013)) , the high diver-
often even three (73%) or more (51%) languages.         sity in used languages and varieties in our corpus
Table 3 shows the number and proportion of Ger-         led to various restraints in corpus creation and pro-
man, Italian and English texts written as first or      cessing as cross-lingual annotation and informa-
second/foreign language.                                tion extraction are still crucial problems in natural
                                                        language processing. We tried to address the de-
  Text written         as L1              as L2         mands of a multilingual corpus by providing lan-
  IT                4,761 (57%)        3,566 (42%)      guage specific PoS tagging and by applying lan-
  DE               23,191 (99%)           170 (1%)      guage independent annotations. We are aware of
  EN                   166 (4%)        3,625 (96%)      the fact that this is by no means sufficient to deal
  All languages    28,120 (78%)        7,842 (22%)      with linguistic research questions that exceed lan-
                                                        guage boundaries. Moreover, manual correction
Table 3: Distribution of text language by L1 or L2
                                                        tasks occupied a significant part of the work on the
use.
                                                        corpus as automatic annotation (e.g. for language
                                                        identification) does not yet provide the accuracy
   In terms of multilingual language use in the
                                                        expected for linguistic studies (Carter et al., 2013;
DiDi corpus, we observe a slight difference be-
                                                        Lui and Baldwin, 2014).
tween Italian and German-speaking users. L1 Ital-
ian speakers stick more to their L1 compared to the         11
                                                            Texts were annotated as mixed-language texts during the
German-speaking participants, who are character-        correction of the language identification, therefore this an-
                                                        notation has not been done for the whole corpus. A further
ized by a higher usage of L2 Italian. The compar-       word-level identification of languages could detect even more
ison of L1 and L2 usage in status updates, com-         mixed-language content(Nguyen and Dogruoz, 2013)
6        Conclusion and Future Work                       Simone Ciccolone. 2010. Lo standard tedesco in Alto
                                                            Adige. Il segno e le lettere. LED Edizioni Universi-
In this paper we presented a freely avail-                  tarie, Milan.
able language corpus of Facebook user profiles
                                                          Jacob Eisenstein. 2013. What to do about bad lan-
from South Tyrol, Italy. The multilingual cor-               guage on the internet. In Proceedings of NAACL-
pus is anonymised and annotated with socio-                  HLT, pages 359–369.
demographic data of users, language specific (and
                                                          Jennifer-Carmen Frey, Egon W. Stemle, and Aivars
for Italian manually corrected) PoS tags, lem-               Glaznieks. 2014. Collecting language data of non-
mas and linguistic annotations mainly related to             public social media profiles. In Gertrud Faaß and
used languages, varieties and multilingual phe-              Josef Ruppenhofer, editors, Workshop Proceedings
nomena. The corpus is accessible for querying                of the 12th Edition of the KONVENS Conference,
                                                             pages 11–15, Hildesheim, Germany, October. Uni-
via ANNIS12 or can be obtained as processable                versitatsverlag Hildesheim, Germany.
data for research purposes on http://www.
eurac.edu/didi.                                           Jennifer-Carmen Frey, Egon W. Stemle, and Aivars
                                                             Glaznieks. 2015. The DiDi Corpus of South Ty-
                                                             rolean CMC Data. In Workshop Proceedings of the
Acknowledgements                                             2nd Workshop on NLP4CMC at GSCL2015.
The project was financed by the Provincia au-             Marco Lui and Timothy Baldwin. 2012. langid.py: An
tonoma di Bolzano – Alto Adige, Ripartizione               off-the-shelf language identification tool. In Pro-
Diritto allo studio, università e ricerca scien-          ceedings of the ACL 2012 system demonstrations,
tifica, Legge provinciale 13 dicembre 2006, n. 14          pages 25–30. Association for Computational Lin-
                                                           guistics.
”Ricerca e innovazione”.
                                                          Marco Lui and Timothy Baldwin. 2014. Accurate lan-
                                                           guage identification of twitter messages. In Pro-
References                                                 ceedings of the 5th Workshop on Language Anal-
                                                           ysis for Social Media (LASM)@ EACL, pages 17–
Timothy Baldwin, Paul Cook, Marco Lui, Andrew              25, Gothenburg. Association for Computational Lin-
  MacKinlay, and Li Wang. 2013. How noisy social           guistics.
  media text, how diffrnt social media sources. In Pro-
  ceedings of the Sixth International Joint Conference    Dong-Phuong Nguyen and A Seza Dogruoz. 2013.
  on Natural Language Processing, pages 356–364.            Word level language identification in online multi-
                                                            lingual communication. Association for Computa-
Valerio Basile and Malvina Nissim. 2013. Sentiment          tional Linguistics.
  analysis on Italian tweets. In Proceedings of the 4th
  Workshop on Computational Approaches to Subjec-         Dat Quoc Nguyen, Dang Duc Pham Dai Quoc Nguyen,
  tivity, Sentiment and Social Media Analysis, pages        and Son Bao Pham. 2014. RDRPOSTagger: A rip-
  100–107.                                                  ple down rules-based part-of-speech tagger. In Pro-
                                                            ceedings of the Demonstrations at the 14th Confer-
Michael Beißwenger. 2013. Das Dortmunder Chat-              ence of the European Chapter of the Association for
  Korpus. Zeitschrift für germanistische Linguistik,       Computational Linguistics, pages 17–20.
  41(1):161–164.
                                                          Helmut Schmid. 1994. Probabilistic part-of-speech
                                                            tagging using decision trees. In Proceedings of the
Manuel Burghardt, Daniel Granvogl, and Christian
                                                            international conference on new methods in lan-
 Wolff. 2016. Creating a Lexicon of Bavarian Di-
                                                            guage processing, volume 12, pages 44–49.
 alect by Means of Facebook Language Data and
 Crowdsourcing. In Proceedings of LREC 2016,              Helmut Schmid. 1995. Improvements in part-of-
 pages 2029–2033.                                           speech tagging with an application to German. In
                                                            Proceedings of the ACL SIGDAT-Workshop.
Simon Carter, Wouter Weerkamp, and Manos
  Tsagkias. 2013. Microblog language identification:
  Overcoming the limitations of short, unedited
  and idiomatic text.    Language Resources and
  Evaluation, 47(1):195–215.

Fabio Celli and Luca Polonio. 2013. Relationships be-
  tween personality and interactions in facebook. So-
  cial Networking: Recent Trends, Emerging Issues
  and Future Outlook, pages 41–54.
    12
         http://annis-tools.org/