=Paper=
{{Paper
|id=Vol-1086/04-paper
|storemode=property
|title=Resource-based Lexical Approach to Tweet-Norm task
|pdfUrl=https://ceur-ws.org/Vol-1086/paper04.pdf
|volume=Vol-1086
|dblpUrl=https://dblp.org/rec/conf/sepln/MoyaCT13
}}
==Resource-based Lexical Approach to Tweet-Norm task==
Resource-based lexical approach to TWEET-NORM task
Aproximación léxica basada en recursos para la tarea TWEET-NORM
Juan M. Cotelo Moya Fermín L. Cruz Jose A. Troyano
Universidad de Sevilla Universidad de Sevilla Universidad de Sevilla
Avda. Reina Mercedes s/n. Avda. Reina Mercedes s/n. Avda. Reina Mercedes s/n.
41012 Sevilla 41012 Sevilla 41012 Sevilla
jcotelo@us.es fcruz@us.es troyano@us.es
Abstract: This paper proposes a resource-based lexical approach for addressing the
TWEET-NORM task. The proposed system exposes a simple but extensible mod-
ular architecture in which each analysis module independently proposes correction
candidates for each OOV word. Each one of these analysis modules tries to address a
specic problem and each one works in a very dierent way. The resources are used
as the main component for the OOV detection system and they works as support for
the validation and ltering of candidates.
Keywords: Twitter, resources, modular architecture, candidates
Resumen: Este artículo propone una aproximación léxica basada en recursos para
abordar la tarea TWEET-NORM. El sistema presenta una arquitectura modular
sencilla pero extensible en la cual cada módulo de análisis propone candidatos para
cada palabra OOV de forma independiente. Cada uno de estos módulos de análisis
intenta abordar una problemática especíca y cada uno opera de forma muy distinta.
Los recursos se usan como base fundamental del sistema de detección de OOVs y
como apoyo para la validación y ltrado de candidatos.
Palabras clave: Twitter, recursos, arquitectura modular, candidatos
1 Introduction and objectives tactic or stylistic variants.
Before doing any normalization process,
One of the most important challenges fac- we previously did a characterization of the
ing us today is how to process and analyze existing phenomena, being easier trying to
the large amount of information on the In- address the underlying causes of OOVs. To
ternet, and especially social networking sites perform this characterization we used a previ-
like Twitter, where millions of people daily ously collected dataset, composed of 3.1 mil-
express ideas and opinions on any topic of lions of tweets related with the 2012 UEFA
interest. These texts, called tweets, are char- European Football Championship.
acterized by having a short length (140 char- The table 1 shows that characteriza-
acters) that is too small compared with the tion and provides examples for each phe-
size of traditional genres. nomenon.The gure 1 also shows the phe-
Consequently, users of these networks have nomena ratio in a clearer way. It is observed
developed a new form of expression that in- that most of errors t into 5 major categories
cludes SMS-style abbreviations, lexical vari- and most of errors are associated with the fast
ants, letters repetitions, use of emoticons, etc. and informal writing in Twitter, usually done
The result is that current NLP tools can have from a mobile device.
problems to process and understand these The categories proposed for this task are
short and noisy texts unless they are normal- coarser than our characterization. Ortho-
ized rst. graphic errors, Texting language and Charac-
The TWEET-NORM lexical normaliza- ter reptitions t into the Variation category,
tion task proposes the automatic cleansing Free Inections and correct words t into the
of a set amount of tweets by identifying and Correct category. Other Language and Ascii
normalizing, abbreviations, words with re- Art would t into NS/NC category.
peated letters, and generally any out of the The system proposed in this paper is based
vocabulary (OOV) words, regardless of syn- on lexical approaches only and it is mainly
Phenomenon Ratio Examples
Ortographic errors 28% sacalo → sácalo, trapirar → transpirar, . . .
Texting Language 22% x2 → por dos, q → que, aro → claro, . . .
Character repetition 15% siiiiiiiii → si, quiiiiieeeeroooo → quiero, . . .
Ascii Art 14% « ¤ oO._.Oo . . .
Free inections 7% besote, gatino, bonico, . . .
Other errors 7% htt, asdafawecas, engoriles, . . .
Other Language 4% ow, ftw, great, lol, . . .
Multiple phenomena 3% diass → días, artooo → rato, . . .
Table 1: Characterization of error phenomena commonly found in Twitter media
Figure 1: Ratio of characterized error phenomena commonly found in Twitter media
composed of three types of components: 2 Architecture and components of
the system
• Resources: Lexicons and similar lan- The architechture of our system proposed for
guage resources, including resources con- this task is pretty straightforward. It is com-
taining specic knowledge of the media posed by several main components:
used. • Preprocessing module
• Rules: Rules for handling common phe- • OOV/IV detection module
nomena found in this type of media as • OOV analyzer modules
excessive character repetition, acronyms
or homophonic errors. • Candidate generator module
• Candidate scoring and selection module
• Lexical distance analysis: Traditional
lexical distance analysis for handling The gure 2 shows all the process ex-
common ortographic errors found on it. plained before and how the components are
interconnected in a single diagram.
The preprocessing module performs the
In essence, our system works straightfor- typical initial processing step done in lexical
wardly: it examines each word at lexical analysis, generating a stream of tokens from
scope, determine if it is an OOV using the tweets taking into account things like hash-
knowledge, generate possible correction can- tags and usernames, numerals, dates and pre-
didates and select the best one. serving emoticons during the splitting.
Figure 2: Architecture and processing steps of the proposed system
The detection module tries to determine table 2 shows some example rules, using
if a token is an OOV or not. This module Python's regular expressions.
performs that detection using resources and
• Edit distance: This module works
checking if a token belongs to any resource.
very similar to distance-based suggestion
We used a set of lexicons, each one provid-
scheme commonly found in spell check-
ing known forms used in Twitter, the Span-
ers. The main dierence is that it takes
ish language, well known emoticons or even
into account multiple lexicons instead a
colloquial inections.
monolithic one.
Given an OOV Token, an analyzer mod-
ule perform some error guessing process and • Language: This module tries to identify
try to estimate corrections from it. The spe- whether language the OOV token actu-
cic process varies for each analyzer. Every ally belongs to.
analyzer provides some kind of basic scoring
providing some degree of condence for each Notice that the language analyzer module
correction proposed. The analyzers used for does not actually perform any correction, be-
this task were the following: cause if the token comes from another lan-
guage only has to be marked not corrected.
This module uses a trigram language guessing
• Tranformation rules: This analyzer module Python 3 implementation (Phi-Long,
holds a collection of hand-crafted rules, 2012) as backend.
each one representing some kind of well The candidate generation module asks for
dened error and transforms that to- candidates to each analyzer, performing a val-
ken into candidate of correction. It is idation and ltering step, thus removing some
possible to generate more than a candi- incorrectly generated candidates from trans-
date due multiple rule matching, but the formation rules according to validation rules
number usually is limited to a few. and used language resources. Also removes
These rules are intended to address phe- duplicates.
nomena that the edit distance module The candidate selector module applies a
does not correctly address like Charac- normalizing scoring function from the con-
ter Repetition or Texting Language. The dence values provides for each candidates,
Matching Processing Example Phenomenon
[ck]n$ con kn, cn, → con Texting Language
x([aeiouáéíóú]) ch\1 xaval, coxe → chaval, coche Texting Language
((\w)(\w))\1+(\2|\3)? \g<1> sisisisisisi, nonononono → si, no Character Repetition
t[qk]m+$ te quiero mucho tkm, tqm → te quiero mucho Texing Language
Table 2: Extract of the transformation rules used in our system
sorts them and selects the best one. 4 Settings and evaluation
The system generates a token stream from The table 4 shows performance values of the
the Tweet using the preprocessing module. system against the provided corpus and ac-
For each token, determines if a token is an tivating dierent analyzer modules. It is ob-
OOV or not using the detector module. If the served that the accuracy of the system im-
token is an IV, no further processing is done proves signicantly as more modules are ac-
because is a valid form. Otherwise, the to- tivated.
ken is an OOV, the candidate generator mod-
However, our implementation perfor-
ule creates a tentative list using the analyzers
mance is hindered due high dierences be-
previously described. As nal step, the can-
tween the preprocessing used to build the cor-
didate selector module selects the best candi-
pus for this task and our detection and pre-
date for correction.
processing system. Our preprocessing mod-
ule leads to dierent sets of OOVs to be con-
3 Resources employed
sidered, often leading to outputs that dier
We have used several lexicons for the detec- in length respect the ones provided with the
tion and analyzing stages in our system. All task. These discrepancies result in a high
of them are in raw text format and one entry rate of what the provided test script counts
per line. as align errors.
The table 3 shows stats about all the lex-
icons used. Modules Accuracy Align Error
Distance module 0.2036 0.2526
Lexicon Entries Description Rule module 0.3307 0.3231
Spanish 1250796 Common forms Distance + Rule 0.3905 0.2281
from Span- Full system 0.5053 0.1684
ish. Based on
LibreOffice
dictionaries. Table 4: System performance with dierent
Genre 40 Common forms re- modules activated
lated to Twitter.
Handcrafted. It is worth mentioning that we used
Emoticons 320 Commonly used threshold of k ≤ 2 for the edit distance mod-
emoticons. Hand- ule because most of correct candidates are
crafted. within that threshold. Though is true that
selecting a higher k includes more candidates,
Table 3: Lexicons used for our proposed sys- most of newly included candidates are not a
tem valid solution and usually they will have a low
condence score and will not be selected.
For the transformation rule module, we
crafted a ruleset of 71 rules. The syntax used
5 Conclusions and future work
in the ruleset le vaguely resembles a CSV
format, being a rule per line and each line We provide a resource-aided lexical solution
holds information about the matching and for the proposed task using an extensible
the transformation process. architecture made of independent modules.
The language detector module uses a tri- Our system has much room for improvement
gram character language model implementa- like adding a ngram segmenter, including con-
tion as backend and uses dictionaries as back- text during analysis or using automatic meth-
o just in case of insucient data for lan- ods for improving the candidate scoring and
guage estimation. selection.
References
Han, Bo and Timothy Baldwin. 2011. Lex-
ical normalisation of short text messages:
makn sens a #twitter. In Proceedings of
the 49th Annual Meeting of the Associa-
tion for Computational Linguistics: Hu-
man Language Technologies - Volume 1,
HLT '11, pages 368378, Stroudsburg, PA,
USA. Association for Computational Lin-
guistics.
Han, Bo, Paul Cook, and Timothy Baldwin.
2013. Lexical normalization for social me-
dia text. ACM Trans. Intell. Syst. Tech-
nol., 4(1):5:15:27, February.
Pennell, Deana and Yang Liu. 2011. A
character-level machine translation ap-
proach for normalization of sms abbrevi-
ations. In IJCNLP, pages 974982.
Phi-Long. 2012. Python 3.3+ implementa-
tion of the language guessing module made
by Jacob R. Rideout for KDE.
Xue, Zhenzhen, Dawei Yin, Brian D Davison,
and BD Davison. 2011. Normalizing mi-
crotext. In Analyzing Microtext.