=Paper=
{{Paper
|id=Vol-1749/paper_016
|storemode=property
|title=Building a Social Media Adapted PoS Tagger Using FlexTag –– A Case Study on Italian Tweets
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_016.pdf
|volume=Vol-1749
|authors=Tobias Horsmann,Torsten Zesch
|dblpUrl=https://dblp.org/rec/conf/clic-it/HorsmannZ16
}}
==Building a Social Media Adapted PoS Tagger Using FlexTag –– A Case Study on Italian Tweets==
<pdf width="1500px">https://ceur-ws.org/Vol-1749/paper_016.pdf</pdf>
<pre>
         Building a Social Media Adapted PoS Tagger Using FlexTag –
                        A Case Study on Italian Tweets
                            Tobias Horsmann         Torsten Zesch
                                  Language Technology Lab
                 Department of Computer Science and Applied Cognitive Science
                            University of Duisburg-Essen, Germany
                  {tobias.horsmann,torsten.zesch}@uni-due.de


                     Abstract                         external resources like PoS dictionaries and word
                                                      clusters that can be easily created from publicly
    English. We present a detailed descrip-           available Italian corpora. The same configuration
    tion of our submission to the PoSTWITA            has been successfully applied for adapting Flex-
    shared-task for PoS tagging of Italian so-        Tag to German social media text (Horsmann and
    cial media text. We train a model based           Zesch, 2016).
    on FlexTag using only the provided train-
    ing data and external resources like word         2   Experimental Setup
    clusters and a PoS dictionary which are
    build from publicly available Italian cor-        We use the FlexTag CRF classifier (Lafferty et al.,
    pora. We find that this minimal adaptation        2001) using a context window of ±1 tokens, the
    strategy, which already worked well for           750 most-frequent character ngrams over all bi,
    German social media data, is also highly          tri and four-grams and boolean features if a token
    effective for Italian.                            contains a hyphen, period, comma, bracket, un-
                                                      derscore, or number. We furthermore use boolean
    Italiano. Vi presentiamo una descrizione          features for capturing whether a token is fully cap-
    dettagliata della nostra partecipazione al        italized, a retweet, an url, a user mention, or a
    task di PoS tagging for Italian Social Me-        hashtag.
    dia Texts (PoSTWITA). Abbiamo creato
                                                      Data We train our tagging model only on the
    un modello basato su FlexTag utilizzando
                                                      annotated data provided by the shared task orga-
    solo i dati forniti e alcune risorse esterne,
                                                      nizers. As this training set is relatively large, we
    come cluster di parole e un dizionario di
                                                      decided against adding additional annotated data
    PoS costruito da corpora italiani disponi-
                                                      from foreign domains which is a common strat-
    bili pubblicamente. Abbiamo scoperto che
                                                      egy to offset small in-domain training sets (Ritter
    questa strategia di adattamento minimo,
                                                      et al., 2011; Horsmann and Zesch, 2016).
    che ha già dato buoni risultati con i dati
    di social media in tedesco, è altamente ef-      Resources Word clusters: We create word clus-
    ficace anche per l’Italiano.                      ters using Brown clustering (Brown et al., 1992)
                                                      from 400 million tokens of Italian Twitter mes-
                                                      sages which have been crawled between the years
1   Introduction
                                                      2011 and 2016.
In this paper, we describe our submission to the         PoS dictionary: We create a PoS dictionary
PoSTWITA Shared-Task 2016 that aims at build-         which stores the three most frequent PoS tags of
ing accurate PoS tagging models for Italian Twit-     a word. We build the dictionary using a PoS anno-
ter messages. We rely on F LEX TAG (Zesch and         tated Italian Wikipedia corpus.1
Horsmann, 2016), a flexible, general purpose PoS         Namelists: We furthermore use lists of first
tagging architecture that can be easily adapted to    names obtained from Wikipedia and extract words
new domains and languages. We re-use the config-      tagged as named entities from the ItWaC web cor-
uration from Horsmann and Zesch (2015) that has       pus (Baroni et al., 2009) to improve coverage of
been shown to be most effective for adapting a tag-   named entities.
ger to the social media domain. Besides training        1
                                                          http://wacky.sslmit.unibo.it/doku.
on the provided annotated data, it mainly relies on   php?id=corpora
                                Acc    Acc                                                   Primary
                                                               Tag             #     Acc
                                All    OOV                                                  Confusion

       TreeTagger Baseline      75.5     -                  ADP A            145    100.0         -
                                                            HASHTAG          115    100.0         -
          PoSTWITA              90.6   80.5
                                                            MENTION          186    100.0         -
          + Clusters            92.7   85.6
                                                            PUNCT            583    100.0         -
          + PoS-Dict            92.2   85.3
                                                            CONJ             123     99.2     VERB
          + Namelist            91.1   81.4
                                                            URL              119     98.3     VERB
          + All Resources       92.9   86.2                 DET              306     95.8     PRON
                                                            ADP              351     95.7      ADV
       Table 1: Results on the test data set                PRON             327     93.3      DET
                                                            NUM               70     92.9       ADJ
Baseline System We compare our results to the               INTJ              66     92.4     NOUN
Italian model of TreeTagger (Schmid, 1995). As              NOUN             607     91.6    PROPN
TreeTagger uses a much more fine-grained tagset             VERB             568     91.6      AUX
than the one used in this shared-task, we map the           AUX              109     90.8     VERB
fine tags mapping as provided by DKPro Core                 ADV              321     90.3    SCONJ
DKProCore (Eckart de Castilho and Gurevych,                 SCONJ             60     90.0     PRON
2014).                                                      ADJ              210     86.2     NOUN
                                                            EMO               79     83.5      SYM
3   Results                                                 PROPN            346     79.5     NOUN
                                                            VERB CLIT         27     77.8     NOUN
Table 1 gives an overview of our results. Be-               SYM               12     72.7    PUNCT
sides the baseline, we show the results for only us-        X                 27     55.6      EMO
ing the available training data (labeled PoSTWITA)
and when adding the different types of external re-    Table 2: Accuracy per word class on the test data
sources.
   The baseline is not competitive to any of our
system configurations, which confirms the gener-       2016). It can be argued whether requiring the PoS
ally poor performance of off-the-shelf PoS taggers     tagger to make this kind of distinction is actually
on the social media domain. Using all resources        a good idea, as it often does not depend on syn-
yields our best result of 92.9%. Among the in-         tactical properties, but on the wider usage context.
dividual resources, word clusters perform best re-     Because of the high number of noun/proper con-
garding overall accuracy as well as accuracy on        fusions, it is also likely that improvements for this
out-of-vocabulary (OOV) tokens. This shows that        class will hide improvements on smaller classes
clusters are also highly effective for Italian, as     that might be more important quality indicators for
was previously shown for English (Owoputi et al.,      social media tagging. In our error analysis, we will
2013) and German (Horsmann and Zesch, 2016).           thus focus on more interesting cases.
   We also computed the confidence interval by bi-        In Table 3, we show examples of selected
nomial normal approximation (α = 0.05). We ob-         tagging errors. In case of the two adjective-
tain an upper bound of 93.6 and a lower bound          determiner confusions both words occurred in the
of 92.2. This shows that our best configuration is     training data, but never as adjectives. The verb
significantly better than using only the provided      examples show cases where incorrectly tagging a
training data. Looking at the official PoSTWITA        verb as an auxiliary leads to a follow up error. We
results, it also shows that there are no significant   have to stress here that the feature set we use for
differences between the top-ranking systems.           training our PoS tagger does not use any linguis-
                                                       tically knowledge about Italian. Thus, adding lin-
Error Analysis In Table 2, we show the ac-             guistically knowledge might help to better inform
curacy for each PoS tag on the test data set.          the tagger how to avoid such errors.
The largest confusion class is between nouns and
proper nouns, which is in line with previous find-     Amount of Training Data The amount of an-
ings for other languages (Horsmann and Zesch,          notated social media text (120k tokens) in this
                Adjective Confusions                                 100
Token           Gold/Pred    Token        Gold/Pred
cazzo             INTJ       successo      VERB
sono             VERB        dal           ADP A                      95
tutti             DET         quel        ADJ / DET


                                                        Accuracy %
 sti            ADJ / DET    cazzo         NOUN
tweet            NOUN        di             ADP                       90
                  Verb Confusions
Token           Gold/Pred    Token        Gold/Pred                   85
maggiormente      ADV         è         AUX / VERB                                         U SING R ESOURCES
dell’            ADP A       sempre         ADV                                              N O R ESOURCES
 essere        VERB / AUX     stata      VERB / AUX                   80
 capito        ADJ / VERB    togliersi   VERB CLIT                         10 20 30 40 50 60 70 80 90
.                PUNCT       dai           ADP A                               % of data used for training

     Table 3: Adjective and Verb confusions            Figure 1: Learning Curve on training data with
                                                       and without resources
shared-task is an order of magnitude larger than
what was used in other shared tasks for tagging        cific knowledge. We make our experiments and
social media text. This raises the question of how     resources publicly available.2
much annotated training data is actually necessary
to train a competitive social media PoS tagging        Acknowledgments
model.
   In Figure 1, we plot two learning curves that       This work was supported by the Deutsche
show how accuracy improves with an increasing          Forschungsgemeinschaft (DFG) under grant No.
amount of training data. We split the training data    GRK 2167, Research Training Group “User-
into ten chunks of equal size and add one addi-        Centred Social Media”.
tional data chunk in each iteration. We show two
curves, one for just using the training data and one
when additionally using all our resources. When        References
using no resources, we see a rather steep and con-     Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
tinuous increase of the learning curve which shows      and Eros Zanchetta. 2009. The WaCky wide
the challenges of the domain to provide sufficient      web: a collection of very large linguistically pro-
training data. Using resources, this need of train-     cessed web-crawled corpora. Language Resources
                                                        and Evaluation, 43(3):209–226.
ing data is compensated and only a small amount
of training data is required to train a good model.    Peter F Brown, Peter V DeSouza, Robert L Mercer,
The curves also show that the remaining problems         Vincent J Della Pietra, and Jenifer C Lai. 1992.
are certainly not being solved by providing more         Class-Based n-gram Models of Natural Language.
                                                         Computational Linguistics, 18:467–479.
training data.
                                                       Richard Eckart de Castilho and Iryna Gurevych. 2014.
4   Summary                                              A broad-coverage collection of portable NLP com-
                                                         ponents for building shareable analysis pipelines.
We presented our contribution to the PoSTWITA            In Proceedings of the Workshop on Open In-
shared task 2016 for PoS tagging of Italian so-          frastructures and Analysis Frameworks for HLT
cial media text. We show that the same adaptation        (OIAF4HLT) at COLING 2014, pages 1–11, Dublin,
strategies that have been applied for English and        Ireland.
German also lead to competitive results for Ital-      Tobias Horsmann and Torsten Zesch. 2015. Effective-
ian. Word clusters are the most effective resource       ness of Domain Adaptation Approaches for Social
and considerably help to reduce the problem of           Media PoS Tagging. In Proceeding of the 2nd Ital-
out-of-vocabulary tokens. In a learning curve ex-        ian Conference on Computational Linguistics, pages
                                                         166–170, Trento, Italy.
periment, we show that adding of more annotated
data is not likely to provide further improvements       2
                                                           https://github.com/Horsmann/
and recommend instead to add more language spe-        EvalitaPoSTWITA2016.git
Tobias Horsmann and Torsten Zesch. 2016. LTL-UDE
  @ EmpiriST 2015: Tokenization and PoS Tagging
  of Social Media Text. In Proceedings of the 10th
  Web as Corpus Workshop, pages 120–126, Berlin,
  Germany.
John D Lafferty, Andrew McCallum, and Fernando
  C N Pereira. 2001. Conditional Random Fields:
  Probabilistic Models for Segmenting and Labeling
  Sequence Data. In Proceedings of the Eighteenth In-
  ternational Conference on Machine Learning, pages
  282–289, San Francisco, CA, USA.
Olutobi Owoputi, Chris Dyer, Kevin Gimpel, Nathan
  Schneider, and Noah A Smith. 2013. Improved
  part-of-speech tagging for online conversational text
  with word clusters. In Proceedings of the 2013 Con-
  ference of the North American Chapter of the Asso-
  ciation for Computational Linguistics: Human Lan-
  guage Technologies.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.
  2011. Named Entity Recognition in Tweets: An Ex-
  perimental Study. In Proceedings of the Conference
  on Empirical Methods in Natural Language Pro-
  cessing, pages 1524–1534, Stroudsburg, PA, USA.

Helmut Schmid. 1995. Improvements In Part-of-
  Speech Tagging With an Application To German. In
  Proceedings of the ACL SIGDAT-Workshop, pages
  47–50.

Torsten Zesch and Tobias Horsmann. 2016. FlexTag:
  A Highly Flexible Pos Tagging Framework. In Pro-
  ceedings of the Tenth International Conference on
  Language Resources and Evaluation (LREC 2016),
  pages 4259–4263, Portorož, Slovenia.

</pre>