<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Specialised Communication and Multilingualism EURAC Research Bolzano/Bozen</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English). The data has been manually anonymised and provides manually corrected part-ofspeech tags for the Italian language texts and manually normalised data for German texts. Moreover, it is annotated with userprovided socio-demographic data (among others L1, gender, age, education, and internet communication habits) from a questionnaire, and linguistic annotations regarding CMC phenomena, languages and varieties. The anonymised corpus is freely available for research purposes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. DiDi e` un corpus di
comunicazione mediata dal computer (CMC),
che raccoglie dati linguistici di area
sudtirolese. Il corpus, multilingue e
sociolinguistico, e` composto da circa
600,000 occorrenze raccolte (previo
consenso all’utilizzo dei dati) dai profili di
136 iscritti a Facebook e residenti in Alto
Adige. Le principali lingue del corpus,
tedesco e italiano (seguite dall’inglese),
riflettono lo spazio plurilingue del
territorio. I dati sono stati manualmente
anonimizzati e i testi in lingua italiana
sono corredati da etichette (manualmente
corrette) per le parti del discorso.
Inoltre, DiDi e` annotato con dati
sociodemografici forniti dall’utente (fra gli
altri: L1, genere, eta`, istruzione e modalita`
di comunicazione via Internet) attraverso
un questionario e contiene ulteriori
annotazioni linguistiche relative a fenomeni
legati alla CMC e agli usi di varieta`
linguistiche. Il corpus anonimizzato e`
liberamente disponibile a fini di ricerca.
1</p>
    </sec>
    <sec id="sec-2">
      <title>The DiDi Project</title>
      <p>
        The autonomous Italian province of South
Tyrol is characterized by a multilingual environment
with three official languages (Italian, German, and
Ladin), an institutional bi- or trilingualism
(depending on the percentage of the Ladin
population), and diverse individual language repertoires
        <xref ref-type="bibr" rid="ref7">(Ciccolone, 2010)</xref>
        .
      </p>
      <p>In the regionally funded DiDi project,1 the goal
was to build a South Tyrolean CMC corpus to
document the current language use of residents and
to analyse it socio-linguistically with a focus on
age. The project initially focused on the
Germanspeaking language group. However, all
information regarding the project, e.g. the invitation to
participate, the privacy agreement, the project web
site, and the questionnaire for socio-demographic
data was published in German and Italian. Hence,
we attracted speakers of both Italian and German.
Accordingly, the collected data is multilingual,
with major parts in German but with a substantial
portion in Italian (100,000 of 600,000 tokens).</p>
      <p>The collected multilingual CMC corpus
combines Facebook status updates, comments, and
private messages with socio-demographic data
(e.g. language biography, internet usage habits,
and general parameters like age, gender, level
of education) of the writers. The data was
enriched with linguistic annotations on thread, text
and token level including language-specific
part1For further information see www.eurac.edu/didi.
of-speech (PoS) and lemma information,
normalisation, and language identification.</p>
      <p>In this paper, we describe the corpus with
respect to its multilingual characteristics and give
special emphasis to the Italian part of the corpus
to which we added manually corrected PoS
annotations. Hence, it presents a continuation of Frey
et al. (2015) which was restricted to German texts
of the corpus, not taking into account the full
variety of data collected for the total corpus.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Corpus Construction</title>
      <p>For the purpose of the DiDi project, we
collected language data from social networking sites
(SNS) and combined it with socio-demographic
data about the writers obtained from a
questionnaire. We chose to collect data from Facebook
as this SNS is well known in South Tyrol, hosts a
wide variety of different communication settings,
and is used over the whole territory by nearly all
groups of the society.</p>
      <p>Related research mainly draws on public data
such as public Facebook groups, Twitter or
chat data (e.g. Celli and Polonio (2013), Basile
and Nissim (2013), Burghardt et al. (2016),
Beißwenger (2013)), excluding the possibility to
analyse discourse patterns of non-public
everyday language use.</p>
      <p>Collecting non-public and personal data for the
DiDi corpus raised technical issues regarding
Italian privacy regulations (which require user
consent incl. privacy statement), the time-saving
acquisition of authentic and complete language data,
and the assignment of language data to
questionnaire data. These issues have been solved by
developing a Facebook application2 that allowed for
the gathering of all three sorts of data (user
consent, language data, questionnaire data) at once.
In addition, the application was easy to share via
Facebook which helped to promote the project and
to reach many potential participants. While data
collection was solely managed by the Facebook
application, we relied on Facebook’s in-platform
means (i.e. users’ sharing and liking) to recruit
participants. In order to reach older users (&gt; 50
years) it was necessary to additionally resort to
Facebook advertisment.3</p>
      <p>2The source code is available at https:
//bitbucket.org/commul/didi_app.</p>
      <p>3For details regarding the technical and strategical design
of the data collection and methods of user recruitment see
Frey et al. (2014).</p>
      <p>With the consent of each participant, the data
was downloaded via the Facebook Graph API4
and from the used questionnaire service5, and
stored in a local MongoDB6 data base. Both
entities were linked via randomised unique identifiers.
A python interface provided access points to
retrieve user and text data from the data base in a
linked and structured format, and also allowed to
rebuild the conversational structure of threads by
linking successive text objects together. This
information can now be used to analyse turn-taking
and language choices within threads.7
3</p>
    </sec>
    <sec id="sec-4">
      <title>Corpus Annotations</title>
      <p>This section describes the annotations added
during the process of corpus construction.8</p>
      <sec id="sec-4-1">
        <title>3.1 Socio-demographic Information about</title>
      </sec>
      <sec id="sec-4-2">
        <title>Participants</title>
        <p>The corpus provides the following
sociodemographic information about the participants
obtained from the online questionnaire: gender,
education, employment, internet communication
habits, communication devices in use, internet
experience, first language(s) (L1), and usage of a
South Tyrolean German or Italian dialect and its
particular origin.
3.2</p>
      </sec>
      <sec id="sec-4-3">
        <title>Linguistic Annotation of Texts</title>
        <p>The corpus was annotated on text and token level
with a series of information.</p>
        <sec id="sec-4-3-1">
          <title>Language identification:</title>
          <p>
            The used languages of a text were identified
in a semi-automatic approach: Firstly,
using the language identification tool langid.py
            <xref ref-type="bibr" rid="ref11">(Lui and Baldwin, 2012)</xref>
            , and secondly,
manually correcting short texts and texts with a
low confidence score.
          </p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Tokenisation:</title>
          <p>The corpus was tokenized with the
Twitter tokenizer ark-twokenize-py9 and
subse4https://developers.facebook.com/docs/
graph-api
5http://www.objectplanet.com/opinio/
6https://www.mongodb.com/
7The source code is available at https:
//bitbucket.org/commul/didi_proxy.</p>
          <p>8See Frey et al. (2015) for detailed information on the
anonymisation procedure and the normalisation and
processing of German texts, including identification of languages and
varieties.</p>
          <p>9https://github.com/myleott/
ark-twokenize-py
quently corrected manually for non-standard
language tokenisation issues.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>Part-of-speech tagging and lemmatization:</title>
          <p>
            (Corrected) tokens were annotated with PoS
tags and lemma information considering the
predominant language of the text at hand.
We tagged Italian texts with the Italian tag
set of the Universal Dependencies project10
using the RDR PoS Tagger
            <xref ref-type="bibr" rid="ref14">(Nguyen et al.,
2014)</xref>
            . Subsequently, we manually corrected
PoS annotations to handle bad tagging
accuracy for social media texts. Additionally, we
used the TreeTagger
            <xref ref-type="bibr" rid="ref15 ref16">(Schmid, 1994; Schmid,
1995)</xref>
            to assign PoS tags for German,
English, Spanish, French and Portuguese texts
applying the standard tagsets for each
language. No manual correction was performed
for these languages.
          </p>
        </sec>
        <sec id="sec-4-3-4">
          <title>Normalisation:</title>
          <p>So far, we have manually normalised
nonstandard language to word-by-word standard
transcriptions only for German texts.</p>
        </sec>
        <sec id="sec-4-3-5">
          <title>Variety of German:</title>
          <p>We classified German texts as dialect,
nondialect or unclassifiable texts applying a
heuristic approach based on the
normalisation.</p>
        </sec>
        <sec id="sec-4-3-6">
          <title>Untranslatable dialect lexemes:</title>
          <p>We have created a lexicon for untranslatable
dialect words encountered during manual
normalisation. The dialect lexicon was used
to post-process out-of-vocabulary (OOV)
tokens in the corpus.</p>
        </sec>
        <sec id="sec-4-3-7">
          <title>Foreign language insertions:</title>
          <p>The most common OOV tokens that we
manually classified as foreign language
vocabulary have been annotated with information
about their language origin.</p>
        </sec>
        <sec id="sec-4-3-8">
          <title>CMC phenomena:</title>
          <p>Emoticons, emojis, @mentions, hashtags,
hyperlinks, and iterations of graphemes and
punctuation marks were annotated
automatically using regular expressions.</p>
        </sec>
        <sec id="sec-4-3-9">
          <title>Topic of the text:</title>
          <p>In order to investigate context factors of
language choice we annotated texts as either
10http://universaldependencies.org/it/
pos/index.html
political or non-political according to a list
of politicians, political parties and political
terms.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Conversation-related Annotations</title>
        <p>We rebuilt conversation threads by linking
successive texts and created thread objects
containing ordered lists of texts that are accessible via the
Python interface. Thread objects contain
information about the used languages and the number of
active interlocutors and recipients of a message as
well as the time passed between two texts.</p>
        <p>As described in Frey et al. (2015), no text
content of non-participants of the DiDi project was
stored, but general information about the
publishing time and the language of the text was kept. If
all interlocutors of a thread were participants of
the project, the whole conversation is available.
3.4</p>
      </sec>
      <sec id="sec-4-5">
        <title>User-related Annotations</title>
        <p>In addition to socio-demographic data, we added
information about the users’ (multilingual)
communicational behaviour, i.e. their primary
language, used languages and the number of
interlocutors.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Corpus Data</title>
      <sec id="sec-5-1">
        <title>Corpus Size</title>
        <p>The DiDi corpus comprises public and non-public
language data of 136 South Tyrolean Facebook
users. The users could choose to provide either
their Facebook wall communication (status
updates and comments), their chat (i.e. private
messages) communication or both. In the end, 50
people provided access to both types of data. 80 users
only provided access to their Facebook wall and 6
users gave us only their chat communication. In
total, the corpus consists of around 600 thousand
tokens that are distributed over the text categories
status updates (172 ,66 tokens), comments (94,512
tokens) and chat messages (328,796 tokens).
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Multilingualism in the Corpus</title>
        <p>The corpus is highly multilingual. Although the
initial intention of the project was to document
the use of German in South Tyrol, German
language content comprises only 58% of the corpus.
13% are written in Italian and 4% in English (the
remainder of the messages was either classified
as unidentifiable language, non-language or other
language). The distribution of the languages is
based on the language backgrounds of the
participants and is comparable to the multilingual
community of South Tyrol. The following tables show
the distribution of profiles, texts and tokens (table
1) and text type (table 2) by L1.</p>
        <p>While very few users wrote only in their first
language, most users used at least two (88%), very
often even three (73%) or more (51%) languages.
Table 3 shows the number and proportion of
German, Italian and English texts written as first or
second/foreign language.</p>
        <p>In terms of multilingual language use in the
DiDi corpus, we observe a slight difference
between Italian and German-speaking users. L1
Italian speakers stick more to their L1 compared to the
German-speaking participants, who are
characterized by a higher usage of L2 Italian. The
comparison of L1 and L2 usage in status updates,
comments and private messages (c.f. Table 4) shows
that the respective L1 is preferred in all messages
types. We find the highest percentage of second
or foreign language use in status updates, whereas
in comments and private messages around 75% of
the texts are written in L1.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Text written</title>
        <p>Status updates
Comments
Messages
Total</p>
        <p>as L1
6,774 (61%)
5,089 (78%)
16,257 (73%)
28,120 (71%)</p>
        <p>Finally, we observed 4,295 code-switching
instances on conversation level and at least 1,653
texts that contain multiple languages11. The
average number of code-switching instances per user
is 10%, meaning that every tenth text does not
continue the language of the previous text in the
thread (the maximum was around every second
text, i.e. 42%). The average proportion of text with
multiple languages per user is 4% (max. 25%).</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5 Issues in Corpus Creation</title>
      <p>
        In addition to general issues of working with
social media texts (e.g. text processing on noisy,
short texts as described for example in
        <xref ref-type="bibr" rid="ref1 ref8">(Baldwin
et al., 2013; Eisenstein, 2013)</xref>
        ) , the high
diversity in used languages and varieties in our corpus
led to various restraints in corpus creation and
processing as cross-lingual annotation and
information extraction are still crucial problems in natural
language processing. We tried to address the
demands of a multilingual corpus by providing
language specific PoS tagging and by applying
language independent annotations. We are aware of
the fact that this is by no means sufficient to deal
with linguistic research questions that exceed
language boundaries. Moreover, manual correction
tasks occupied a significant part of the work on the
corpus as automatic annotation (e.g. for language
identification) does not yet provide the accuracy
expected for linguistic studies
        <xref ref-type="bibr" rid="ref12 ref5">(Carter et al., 2013;
Lui and Baldwin, 2014)</xref>
        .
      </p>
      <p>
        11Texts were annotated as mixed-language texts during the
correction of the language identification, therefore this
annotation has not been done for the whole corpus. A further
word-level identification of languages could detect even more
mixed-language content
        <xref ref-type="bibr" rid="ref13 ref2 ref5 ref6">(Nguyen and Dogruoz, 2013)</xref>
        In this paper we presented a freely
available language corpus of Facebook user profiles
from South Tyrol, Italy. The multilingual
corpus is anonymised and annotated with
sociodemographic data of users, language specific (and
for Italian manually corrected) PoS tags,
lemmas and linguistic annotations mainly related to
used languages, varieties and multilingual
phenomena. The corpus is accessible for querying
via ANNIS12 or can be obtained as processable
data for research purposes on http://www.
eurac.edu/didi.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>The project was financed by the Provincia
autonoma di Bolzano – Alto Adige, Ripartizione
Diritto allo studio, universita` e ricerca
scientifica, Legge provinciale 13 dicembre 2006, n. 14
”Ricerca e innovazione”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          , Paul Cook, Marco Lui, Andrew MacKinlay, and
          <string-name>
            <given-names>Li</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>How noisy social media text, how diffrnt social media sources</article-title>
          .
          <source>In Proceedings of the Sixth International Joint Conference on Natural Language Processing</source>
          , pages
          <fpage>356</fpage>
          -
          <lpage>364</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Sentiment analysis on Italian tweets</article-title>
          .
          <source>In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis</source>
          , pages
          <fpage>100</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Beißwenger</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Das Dortmunder ChatKorpus</article-title>
          .
          <source>Zeitschrift fu¨r germanistische Linguistik</source>
          ,
          <volume>41</volume>
          (
          <issue>1</issue>
          ):
          <fpage>161</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Manuel</surname>
            <given-names>Burghardt</given-names>
          </string-name>
          , Daniel Granvogl, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Wolff</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Creating a Lexicon of Bavarian Dialect by Means of Facebook Language Data and Crowdsourcing</article-title>
          .
          <source>In Proceedings of LREC 2016</source>
          , pages
          <fpage>2029</fpage>
          -
          <lpage>2033</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Simon</given-names>
            <surname>Carter</surname>
          </string-name>
          , Wouter Weerkamp, and
          <string-name>
            <given-names>Manos</given-names>
            <surname>Tsagkias</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text</article-title>
          .
          <source>Language Resources and Evaluation</source>
          ,
          <volume>47</volume>
          (
          <issue>1</issue>
          ):
          <fpage>195</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Celli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Luca</given-names>
            <surname>Polonio</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Relationships between personality and interactions in facebook</article-title>
          .
          <source>Social Networking: Recent Trends, Emerging Issues and Future Outlook</source>
          , pages
          <fpage>41</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Simone</given-names>
            <surname>Ciccolone</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Lo standard tedesco in Alto Adige</article-title>
          .
          <article-title>Il segno e le lettere</article-title>
          .
          <source>LED Edizioni Universitarie</source>
          , Milan.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>What to do about bad language on the internet</article-title>
          .
          <source>In Proceedings of NAACLHLT</source>
          , pages
          <fpage>359</fpage>
          -
          <lpage>369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Jennifer-Carmen</surname>
            <given-names>Frey</given-names>
          </string-name>
          , Egon W. Stemle, and
          <string-name>
            <given-names>Aivars</given-names>
            <surname>Glaznieks</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Collecting language data of nonpublic social media profiles</article-title>
          .
          <source>In Gertrud Faaß and Josef Ruppenhofer</source>
          , editors,
          <source>Workshop Proceedings of the 12th Edition of the KONVENS Conference</source>
          , pages
          <fpage>11</fpage>
          -
          <lpage>15</lpage>
          , Hildesheim, Germany, October. Universitatsverlag Hildesheim, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Jennifer-Carmen</surname>
            <given-names>Frey</given-names>
          </string-name>
          , Egon W. Stemle, and
          <string-name>
            <given-names>Aivars</given-names>
            <surname>Glaznieks</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The DiDi Corpus of South Tyrolean CMC Data</article-title>
          .
          <source>In Workshop Proceedings of the 2nd Workshop on NLP4CMC at GSCL2015.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Lui</surname>
          </string-name>
          and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>langid.py: An off-the-shelf language identification tool</article-title>
          .
          <source>In Proceedings of the ACL 2012 system demonstrations</source>
          , pages
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Lui</surname>
          </string-name>
          and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Accurate language identification of twitter messages</article-title>
          .
          <source>In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)@ EACL</source>
          , pages
          <fpage>17</fpage>
          -
          <lpage>25</lpage>
          , Gothenburg. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Dong-Phuong Nguyen</surname>
            and
            <given-names>A Seza</given-names>
          </string-name>
          <string-name>
            <surname>Dogruoz</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Word level language identification in online multilingual communication. Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Dat</given-names>
            <surname>Quoc</surname>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <source>Dang Duc Pham Dai Quoc Nguyen, and Son Bao Pham</source>
          .
          <year>2014</year>
          .
          <article-title>RDRPOSTagger: A ripple down rules-based part-of-speech tagger</article-title>
          .
          <source>In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , pages
          <fpage>17</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
          .
          <source>In Proceedings of the international conference on new methods in language processing</source>
          , volume
          <volume>12</volume>
          , pages
          <fpage>44</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Improvements in part-ofspeech tagging with an application to German</article-title>
          .
          <source>In Proceedings of the ACL SIGDAT-Workshop.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>