=Paper=
{{Paper
|id=Vol-1749/paper_016
|storemode=property
|title=Building a Social Media Adapted PoS Tagger Using FlexTag –– A Case Study on Italian Tweets
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_016.pdf
|volume=Vol-1749
|authors=Tobias Horsmann,Torsten Zesch
|dblpUrl=https://dblp.org/rec/conf/clic-it/HorsmannZ16
}}
==Building a Social Media Adapted PoS Tagger Using FlexTag –– A Case Study on Italian Tweets==
Building a Social Media Adapted PoS Tagger Using FlexTag –
A Case Study on Italian Tweets
Tobias Horsmann Torsten Zesch
Language Technology Lab
Department of Computer Science and Applied Cognitive Science
University of Duisburg-Essen, Germany
{tobias.horsmann,torsten.zesch}@uni-due.de
Abstract external resources like PoS dictionaries and word
clusters that can be easily created from publicly
English. We present a detailed descrip- available Italian corpora. The same configuration
tion of our submission to the PoSTWITA has been successfully applied for adapting Flex-
shared-task for PoS tagging of Italian so- Tag to German social media text (Horsmann and
cial media text. We train a model based Zesch, 2016).
on FlexTag using only the provided train-
ing data and external resources like word 2 Experimental Setup
clusters and a PoS dictionary which are
build from publicly available Italian cor- We use the FlexTag CRF classifier (Lafferty et al.,
pora. We find that this minimal adaptation 2001) using a context window of ±1 tokens, the
strategy, which already worked well for 750 most-frequent character ngrams over all bi,
German social media data, is also highly tri and four-grams and boolean features if a token
effective for Italian. contains a hyphen, period, comma, bracket, un-
derscore, or number. We furthermore use boolean
Italiano. Vi presentiamo una descrizione features for capturing whether a token is fully cap-
dettagliata della nostra partecipazione al italized, a retweet, an url, a user mention, or a
task di PoS tagging for Italian Social Me- hashtag.
dia Texts (PoSTWITA). Abbiamo creato
Data We train our tagging model only on the
un modello basato su FlexTag utilizzando
annotated data provided by the shared task orga-
solo i dati forniti e alcune risorse esterne,
nizers. As this training set is relatively large, we
come cluster di parole e un dizionario di
decided against adding additional annotated data
PoS costruito da corpora italiani disponi-
from foreign domains which is a common strat-
bili pubblicamente. Abbiamo scoperto che
egy to offset small in-domain training sets (Ritter
questa strategia di adattamento minimo,
et al., 2011; Horsmann and Zesch, 2016).
che ha già dato buoni risultati con i dati
di social media in tedesco, è altamente ef- Resources Word clusters: We create word clus-
ficace anche per l’Italiano. ters using Brown clustering (Brown et al., 1992)
from 400 million tokens of Italian Twitter mes-
sages which have been crawled between the years
1 Introduction
2011 and 2016.
In this paper, we describe our submission to the PoS dictionary: We create a PoS dictionary
PoSTWITA Shared-Task 2016 that aims at build- which stores the three most frequent PoS tags of
ing accurate PoS tagging models for Italian Twit- a word. We build the dictionary using a PoS anno-
ter messages. We rely on F LEX TAG (Zesch and tated Italian Wikipedia corpus.1
Horsmann, 2016), a flexible, general purpose PoS Namelists: We furthermore use lists of first
tagging architecture that can be easily adapted to names obtained from Wikipedia and extract words
new domains and languages. We re-use the config- tagged as named entities from the ItWaC web cor-
uration from Horsmann and Zesch (2015) that has pus (Baroni et al., 2009) to improve coverage of
been shown to be most effective for adapting a tag- named entities.
ger to the social media domain. Besides training 1
http://wacky.sslmit.unibo.it/doku.
on the provided annotated data, it mainly relies on php?id=corpora
Acc Acc Primary
Tag # Acc
All OOV Confusion
TreeTagger Baseline 75.5 - ADP A 145 100.0 -
HASHTAG 115 100.0 -
PoSTWITA 90.6 80.5
MENTION 186 100.0 -
+ Clusters 92.7 85.6
PUNCT 583 100.0 -
+ PoS-Dict 92.2 85.3
CONJ 123 99.2 VERB
+ Namelist 91.1 81.4
URL 119 98.3 VERB
+ All Resources 92.9 86.2 DET 306 95.8 PRON
ADP 351 95.7 ADV
Table 1: Results on the test data set PRON 327 93.3 DET
NUM 70 92.9 ADJ
Baseline System We compare our results to the INTJ 66 92.4 NOUN
Italian model of TreeTagger (Schmid, 1995). As NOUN 607 91.6 PROPN
TreeTagger uses a much more fine-grained tagset VERB 568 91.6 AUX
than the one used in this shared-task, we map the AUX 109 90.8 VERB
fine tags mapping as provided by DKPro Core ADV 321 90.3 SCONJ
DKProCore (Eckart de Castilho and Gurevych, SCONJ 60 90.0 PRON
2014). ADJ 210 86.2 NOUN
EMO 79 83.5 SYM
3 Results PROPN 346 79.5 NOUN
VERB CLIT 27 77.8 NOUN
Table 1 gives an overview of our results. Be- SYM 12 72.7 PUNCT
sides the baseline, we show the results for only us- X 27 55.6 EMO
ing the available training data (labeled PoSTWITA)
and when adding the different types of external re- Table 2: Accuracy per word class on the test data
sources.
The baseline is not competitive to any of our
system configurations, which confirms the gener- 2016). It can be argued whether requiring the PoS
ally poor performance of off-the-shelf PoS taggers tagger to make this kind of distinction is actually
on the social media domain. Using all resources a good idea, as it often does not depend on syn-
yields our best result of 92.9%. Among the in- tactical properties, but on the wider usage context.
dividual resources, word clusters perform best re- Because of the high number of noun/proper con-
garding overall accuracy as well as accuracy on fusions, it is also likely that improvements for this
out-of-vocabulary (OOV) tokens. This shows that class will hide improvements on smaller classes
clusters are also highly effective for Italian, as that might be more important quality indicators for
was previously shown for English (Owoputi et al., social media tagging. In our error analysis, we will
2013) and German (Horsmann and Zesch, 2016). thus focus on more interesting cases.
We also computed the confidence interval by bi- In Table 3, we show examples of selected
nomial normal approximation (α = 0.05). We ob- tagging errors. In case of the two adjective-
tain an upper bound of 93.6 and a lower bound determiner confusions both words occurred in the
of 92.2. This shows that our best configuration is training data, but never as adjectives. The verb
significantly better than using only the provided examples show cases where incorrectly tagging a
training data. Looking at the official PoSTWITA verb as an auxiliary leads to a follow up error. We
results, it also shows that there are no significant have to stress here that the feature set we use for
differences between the top-ranking systems. training our PoS tagger does not use any linguis-
tically knowledge about Italian. Thus, adding lin-
Error Analysis In Table 2, we show the ac- guistically knowledge might help to better inform
curacy for each PoS tag on the test data set. the tagger how to avoid such errors.
The largest confusion class is between nouns and
proper nouns, which is in line with previous find- Amount of Training Data The amount of an-
ings for other languages (Horsmann and Zesch, notated social media text (120k tokens) in this
Adjective Confusions 100
Token Gold/Pred Token Gold/Pred
cazzo INTJ successo VERB
sono VERB dal ADP A 95
tutti DET quel ADJ / DET
Accuracy %
sti ADJ / DET cazzo NOUN
tweet NOUN di ADP 90
Verb Confusions
Token Gold/Pred Token Gold/Pred 85
maggiormente ADV è AUX / VERB U SING R ESOURCES
dell’ ADP A sempre ADV N O R ESOURCES
essere VERB / AUX stata VERB / AUX 80
capito ADJ / VERB togliersi VERB CLIT 10 20 30 40 50 60 70 80 90
. PUNCT dai ADP A % of data used for training
Table 3: Adjective and Verb confusions Figure 1: Learning Curve on training data with
and without resources
shared-task is an order of magnitude larger than
what was used in other shared tasks for tagging cific knowledge. We make our experiments and
social media text. This raises the question of how resources publicly available.2
much annotated training data is actually necessary
to train a competitive social media PoS tagging Acknowledgments
model.
In Figure 1, we plot two learning curves that This work was supported by the Deutsche
show how accuracy improves with an increasing Forschungsgemeinschaft (DFG) under grant No.
amount of training data. We split the training data GRK 2167, Research Training Group “User-
into ten chunks of equal size and add one addi- Centred Social Media”.
tional data chunk in each iteration. We show two
curves, one for just using the training data and one
when additionally using all our resources. When References
using no resources, we see a rather steep and con- Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
tinuous increase of the learning curve which shows and Eros Zanchetta. 2009. The WaCky wide
the challenges of the domain to provide sufficient web: a collection of very large linguistically pro-
training data. Using resources, this need of train- cessed web-crawled corpora. Language Resources
and Evaluation, 43(3):209–226.
ing data is compensated and only a small amount
of training data is required to train a good model. Peter F Brown, Peter V DeSouza, Robert L Mercer,
The curves also show that the remaining problems Vincent J Della Pietra, and Jenifer C Lai. 1992.
are certainly not being solved by providing more Class-Based n-gram Models of Natural Language.
Computational Linguistics, 18:467–479.
training data.
Richard Eckart de Castilho and Iryna Gurevych. 2014.
4 Summary A broad-coverage collection of portable NLP com-
ponents for building shareable analysis pipelines.
We presented our contribution to the PoSTWITA In Proceedings of the Workshop on Open In-
shared task 2016 for PoS tagging of Italian so- frastructures and Analysis Frameworks for HLT
cial media text. We show that the same adaptation (OIAF4HLT) at COLING 2014, pages 1–11, Dublin,
strategies that have been applied for English and Ireland.
German also lead to competitive results for Ital- Tobias Horsmann and Torsten Zesch. 2015. Effective-
ian. Word clusters are the most effective resource ness of Domain Adaptation Approaches for Social
and considerably help to reduce the problem of Media PoS Tagging. In Proceeding of the 2nd Ital-
out-of-vocabulary tokens. In a learning curve ex- ian Conference on Computational Linguistics, pages
166–170, Trento, Italy.
periment, we show that adding of more annotated
data is not likely to provide further improvements 2
https://github.com/Horsmann/
and recommend instead to add more language spe- EvalitaPoSTWITA2016.git
Tobias Horsmann and Torsten Zesch. 2016. LTL-UDE
@ EmpiriST 2015: Tokenization and PoS Tagging
of Social Media Text. In Proceedings of the 10th
Web as Corpus Workshop, pages 120–126, Berlin,
Germany.
John D Lafferty, Andrew McCallum, and Fernando
C N Pereira. 2001. Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling
Sequence Data. In Proceedings of the Eighteenth In-
ternational Conference on Machine Learning, pages
282–289, San Francisco, CA, USA.
Olutobi Owoputi, Chris Dyer, Kevin Gimpel, Nathan
Schneider, and Noah A Smith. 2013. Improved
part-of-speech tagging for online conversational text
with word clusters. In Proceedings of the 2013 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies.
Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.
2011. Named Entity Recognition in Tweets: An Ex-
perimental Study. In Proceedings of the Conference
on Empirical Methods in Natural Language Pro-
cessing, pages 1524–1534, Stroudsburg, PA, USA.
Helmut Schmid. 1995. Improvements In Part-of-
Speech Tagging With an Application To German. In
Proceedings of the ACL SIGDAT-Workshop, pages
47–50.
Torsten Zesch and Tobias Horsmann. 2016. FlexTag:
A Highly Flexible Pos Tagging Framework. In Pro-
ceedings of the Tenth International Conference on
Language Resources and Evaluation (LREC 2016),
pages 4259–4263, Portorož, Slovenia.