=Paper=
{{Paper
|id=Vol-1086/06-paper
|storemode=property
|title=unimelb: Spanish Text Normalisation
|pdfUrl=https://ceur-ws.org/Vol-1086/paper06.pdf
|volume=Vol-1086
|dblpUrl=https://dblp.org/rec/conf/sepln/HanCB13
}}
==unimelb: Spanish Text Normalisation==
unimelb: Spanish Text Normalisation
unimelb: Normalización de texto en español
Bo Han,1,2 Paul Cook1 and Timothy Baldwin1,2
1
Department of Computing and Information Systems, The University of Melbourne
2
NICTA Victoria Research Lab
hanb@student.unimelb.edu.au, paulcook@unimelb.edu.au, tb@ldwin.net
Resumen: El presente artı́culo describe una aproximación a la normalización de
texto basada en léxico para tweets en español. En primer lugar se realiza una
comparación entre la normalización de texto en español e inglés y se plantea la
hipótesis de que se puede adaptar un enfoque similar ya planteado previamente para
el inglés. Para ello, se construye un léxico de normalización a partir de un corpus,
utilizando similaridad distribucional, y se combina con otros léxicos existentes (por
ejemplo diccionarioes de jerga de Internet en español). Estos léxicos permiten una
solución rápida basada en búsquedas. Los resultados experimentales indican que
el léxico derivado del corpus complementa bien a los léxicos existentes, pero que la
solución puede mejorarse con un mejor manejo de ciertos tipos de palabras, como
las entidades con nombre.
Palabras clave: Twitter, español, normalización de texto
Abstract: This paper describes a lexicon-based text normalisation approach for
Spanish tweets. We first compare English and Spanish text normalisation, and
hypothesise that an approach previously proposed for English can be adapted to
Spanish. A corpus-derived normalisation lexicon is built using distributional sim-
ilarity, and is combined with existing lexicons (e.g., containing Spanish Internet
slang). These lexicons enable a very fast, look-up based approach to text normali-
sation. Experimental results indicate that the corpus-derived lexicon complements
existing lexicons, but that the approach could be improved through better handling
of certain word types, such as named entities.
Keywords: Twitter, Spanish, Text Normalisation
1 Introduction proach to Spanish text normalisation. In par-
ticular, we adapt the method of Han, Cook,
A tremendous amount of user-generated text and Baldwin (2012) to build a normalisa-
is produced on social media sites such as tion lexicon that maps non-standard words
Twitter and Facebook, and can be lever- to their standard forms relative to a vocabu-
aged for natural language processing (NLP) lary, i.e., out-of-vocabulary (OOV) words are
tasks such as sentiment analysis (Jiang et mapped deterministically to in-vocabulary
al., 2011) and event detection (Weng and (IV) words. This enables a very fast, look-up
Lee, 2011). However, this user-generated text based approach to text normalisation. In our
is noisy, and contains various non-standard approach an OOV word is first looked up in
words, e.g., jajaja (“ja”) and queee (“que”). an automatically-derived normalisation lexi-
These non-standard words are not recognised con that is complemented with entries from
by off-the-shelf NLP tools, and may conse- Spanish Internet slang dictionaries and the
quently degrade the utility of NLP on so- development data. If the OOV word is found
cial media. One way to tackle this chal- in this lexicon it is normalised according to its
lenge is text normalisation — restoring these entry, otherwise it is left unchanged. During
non-standard words to their canonical forms, this normalisation step, OOV words and the
e.g., transforming jajaja to “ja” and queee resulting normalisations are down-cased, so
to “que” (Eisenstein, 2013; Han, Cook, and a final case restoration step is performed to
Baldwin, 2013). appropriately capitalise the lowercased nor-
This paper proposes a lexicon-based ap- malisations.
2 Comparing English and 3.1 Resources
Spanish Text Normalisation Our normalisation transforms OOV forms to
The lexicon-based normalisation approach of IV words, and thus a Spanish lexicon is re-
Han, Cook, and Baldwin (2012) was evalu- quired to determine what is OOV. To this
ated on English tweets. In this section we end, we use the Freeling 3.0 Spanish dic-
consider the plausibility of adapting their tionary (Padró and Stanilovsky, 2012) which
method from English to Spanish, and iden- contains 669k words.
tify the following key factors: We collected 146 Spanish Internet slang
expressions and cell phone abbreviations
Orthography: if we consider diacriticised
from the web (Slang lexicon).2 We fur-
letters as single characters, Spanish has more
ther extracted normalisation pairs from the
characters than English, and diacritics can
development data (Dev Lexicon).
lead to differences in meaning, e.g., más
means “more”, and mas means “but”. The Analysing the development data we no-
method of Han, Cook, and Baldwin (2012) ticed that many person names are not cor-
uses Levenshtein distance to measure string rectly capitalised. We formed Name Lex-
similarity. We simply convert all characters icon from a list of 277 common Spanish
to fused Unicode code points (treating á and names.3 This lexicon maps lowercase person
a as different characters) and compute Lev- names to their correctly capitalised forms.
enshtein distance over these forms. 3.2 Corpus-derived Lexicon
Word segmentation: Spanish and En- The small, manually-crafted normalisation
glish words both largely use whitespace seg- lexicons from Section 3.1 have low coverage
mentation, so similar tokenisation strategies over non-standard words. To improve cov-
can be used. erage, we automatically derive a much larger
Morphophonemics: Phonetic modeling normalisation lexicon based on distributional
of words — a component of the method similarity (Dist Lexicon) by adapting the
of Han, Cook, and Baldwin (2012) — is method of Han, Cook, and Baldwin (2012).
available for Spanish using an off-the-shelf We collected 283 million Spanish tweets
Double Metaphone implementation.1 via the Twitter Streaming API4 from
Lexical resources: A lexicon and slang 21/09/2011–28/02/2012. Spanish tweets
dictionary — key resources for the method of were identified using langid.py (Lui and
Han, Cook, and Baldwin (2012) — are avail- Baldwin, 2012). The tweets were tokenised
able for Spanish. using a simplified English Twitter tokeniser
Overall, English and Spanish text share (O’Connor, Krieger, and Ahn, 2010). Ex-
important features, and we hypothesise that cessive repetitions of characters (i.e., ≥ 3)
adapting a lexicon-based English normalisa- in words are shortened to one character to
tion system to Spanish is feasible. ensure different variations of the same pat-
One important component of this Spanish tern are merged. To improve coverage, we re-
normalisation task is case restoration: e.g., moved the restriction from the original work
maria as a name should be normalised to that only OOVs with ≥ 4 letters were con-
“Maria”. Most previous English Twitter nor- sidered as candidates for normalisation.
malisation tasks have focused on lowercase For a given OOV, we define its confusion
words and ignored capitalisation. set to be all IV words with Levenshtein dis-
tance ≤ 2 in terms of characters or ≤ 1 in
3 System Description terms of Double Metaphone code. We rank
the items in the confusion set according to
The system consists of two steps: (1) down-
their distributional similarity to the OOV.
case all OOVs and normalise them based on
Han, Cook, and Baldwin (2012) considered
a normalisation lexicon which combines en-
many configurations of distributional similar-
tries from existing lexicons (Section 3.1) and
ity for normalisation of English tweets. We
entries automatically learnt from a Twitter
corpus (Section 3.2); (2) restore case for nor- 2
http://goo.gl/wgCFSs and http://goo.gl/
malised words (Section 3.3). xsYkDe, both accessed on 26/06/2013
3
https://en.wikipedia.org/wiki/Spanish_
1
https://github.com/amsqr/ naming_customs
4
Spanish-Metaphone https://dev.twitter.com
Rank callendo guau Lexicon Accuracy
1 cayendo 0.713 y 1.756 Combined Lexicon 0.52
2 saliendo 3.896 que 1.873 − Slang Lexicon 0.51
3 fallando 4.303 la 2.488 − Dev Lexicon 0.46
4 rallando 6.761 a 2.649 − Dist Lexicon 0.42
5 valiendo 6.878 no 3.206 − Name Lexicon 0.51
+ Edit distance 0.54
Table 1: The KL divergence for the top-five Baseline 0.20
candidates for callendo and guau.
Table 2: Accuracy of lexicon-based normali-
sation systems. “−” indicates the removal of
a particular lexicon.
3.3 Case Restoration
We set the case of each token that was
normalised in the previous step (which is
down-cased at the current stage) to its most-
frequent casing in our corpus of Spanish
tweets. We also capitalise all normalised to-
Figure 1: KL divergence ratio cut-off vs. pre- kens occurring at the beginning of a tweet, or
cision of the derived normalisation lexicon on following a period or question mark.
the development data and Slang Lexicon.
4 Results and Discussion
use the same settings they selected: con- We evaluated the lexicons using classification
text is represented by positionally-indexed accuracy, the official metric for this shared
bigrams using a window size of ±2 tokens; task, on the tweet-norm test data. This met-
similarity is measured using KL divergence. ric divides the number of correct proposals —
An entry in the normalisation dictionary then OOVs correctly normalised or left unchanged
consists of the OOV and its top-ranked IV. — by the number of OOVs in the collection.
From development data, we observe that This is termed “precision” by the task organ-
in many cases when a correct normalisation isers, but a true measure of precision would
is identified, there is a large difference in be based on the number of OOVs that were
KL divergence between the first- and second- actually normalised. We therefore use the
ranked IVs. Conversely, if the KL divergence term “accuracy” here.
of the first- and second-ranked normalisation We submitted two runs for the task. The
candidates is similar, the normalisation is of- first, Combined Lexicon (Table 4), uses
ten less reliable. As shown in Table 3.2, only the combination of lexicons from Section
callendo (“cayendo”) is a correctly-derived 3, and achieves an accuracy of 0.52. The sec-
(OOV, IV) pair, but guau (“y”) is not. ond run builds on Combined Lexicon but
Motivated by this observation, we filter incorporates normalisation based on charac-
the derived (OOV, IV) pairs by the KL di- ter edit distance for words with many re-
vergence ratio of the first- and second-ranked peated characters. We observed that such
IV words for the OOV. Setting a high thresh- words are often non-standard, and tend not
old on this KL divergence ratio increases the to occur in the lexicons because of their rel-
reliability of the derived lexicon, but reduces atively low frequency. For words with ≥ 3
its coverage. This ratio was tested for values repeated characters, we remove all but one of
from 1.0 to 3.0 with a step size of 0.1 over the the repeated characters, and then select the
development data and Slang Lexicon. As most similar IV word according to character-
shown in Figure 1, the best precision (94.0%) based Levenshtein distance. The accuracy of
is achieved when the ratio is 1.9.5 We directly this run is 0.54 (+ Edit distance, Table 4).
use this setting to derive the final lexicon, We further consider an ablative analysis of
instead of further re-ranking the (OOV,IV) the component lexicons of Combined Lex-
pairs using string similarity. icon. As shown in Table 4, when Slang
Lexicon (− Slang Lexicon) or Name
5
Here precision is defined as #correct normalisations
#normalisations
. Lexicon (− Name Lexicon) are excluded,
accuracy declines only slightly. Although this Error type Number Percentage
suggests that existing resources play only a
minor role in the normalisation of Spanish Incorrect lexical form 22 37%
tweets, this is likely due in part to the rela- Not available 19 32%
tively small size of Slang Lexicon, which is Accent error 10 17%
much smaller than similar English resources Case error 5 8%
One to many 2 3%
that have been effectively exploited in nor- Annotation error 1 2%
malisation — i.e., 145 Spanish entries versus
5k English entries used by Han and Baldwin Table 3: Categorisation of false positives.
(2011). Furthermore, Slang Lexicon might
have little impact due to differences between
deletions), or present in the test data, but not
Spanish Twitter and SMS, the latter being
found in the tweets, and excluded in the gold
the primary focus of Slang Lexicon.
standard. These error types are denoted as
On the other hand, normalisation lexi- “Not available” in Table 4, and account for
cons derived from tweets — whether based the second largest source of false positives.
on the development data (Dev Lexicon) or
Incorrect accents and casing account for
automatically learnt (Dist Lexicon) — sub-
17% and 8% of false positives, respectively.
stantially impact on accuracy (− Dev Lex-
In both of these cases, contextual informa-
icon and − Dist Lexicon). These findings
tion, which is not incorporated in the pro-
for the automatically derived Dist Lexicon
posed approach, could be helpful. Finally,
are in line with previous findings for English
we identified two one-to-many normalisations
Twitter normalisation (Han, Cook, and Bald-
(which are outside the scope of our normali-
win, 2012) that indicate that such lexicons
sation system), and one case we judged to be
can substantially improve recall with little
an annotation error.
impact on precision.
We analysed a random sample of 20 of the
We considered an experiment in which we
280 false negatives, and found irregular char-
used Combined Lexicon, but ignored case
acter repetitions and named entities to be the
in the evaluation; the accuracy was 0.56.
main sources of errors, e.g., uajajajaa (“ja”)
This corresponds to the upper-bound on ac-
and Pedroo (“Pedro”).6 The lexicon-based
curacy if our system performed case restora-
approach could be improved, for example, by
tion perfectly, and suggests that improving
using additional regular expressions to cap-
the case restoration of our system would not
ture repetitions of character sequences. Er-
lead to substantial gains in accuracy.
rors involving named entities reveal the lim-
In the final row of Table 4 we show re-
itations of using the Freeling 3.0 Spanish
sults for a baseline method which makes no
dictionary as the IV lexicon, as it has limited
attempt to normalise the input. All lexicon-
coverage of named entities. A corpus-derived
based methods improve substantially over
lexicon (e.g., from Wikipedia) could help im-
this baseline.
prove the coverage.
To further analyse our lexicon-based nor-
malisation approach, we categorise the errors
for both false positives (OOVs that were nor-
5 Summary
malised, but incorrectly so) and false nega- In this paper, we applied a lexicon-based ap-
tives (OOVs that were not normalised, but proach to normalise non-standard words in
should have been). As shown in Table 4, 37% Spanish tweets. Our analysis suggests that
of false positives are incorrect lexical forms, the corpus-derived lexicon based on distribu-
e.g., algerooo is normalised to “algero” and tional similarity improves accuracy, but that
not its correct form “alegra”. Further ex- this approach is limited in terms of flexibility
amination shows that 23% of these cases are (e.g., to capture accent variation) and lex-
incorrectly normalised to “que”, suggesting icon coverage (e.g., of named entities). In
that distributional similarity alone is insuffi- future work, we plan to expand the IV lexi-
cient to capture normalisations for some non- con, and incorporate contextual information
standard words. to improve normalisation involving accents
Surprisingly, we found some OOVs in- and casing.
cluded in the test data, but excluded from
6
the gold-standard annotations (due to tweet Pedro is not in our collected list of Spanish names.
Acknowledgements Computational Linguistics (ACL 2012)
NICTA is funded by the Australian gov- Demo Session, pages 25–30, Jeju, Repub-
ernment as represented by Department of lic of Korea.
Broadband, Communication and Digital O’Connor, Brendan, Michel Krieger, and
Economy, and the Australian Research Coun- David Ahn. 2010. TweetMotif: Ex-
cil through the ICT centre of Excellence pro- ploratory search and topic summarization
gramme. The authors would like to thank for Twitter. In Proceedings of Fourth In-
the anonymous reviewers for their valuable ternational AAAI Conference on Weblogs
feedback and language expertise. and Social Media, pages 384–385, Wash-
ington, USA.
References
Padró, Lluı́s and Evgeny Stanilovsky. 2012.
Eisenstein, Jacob. 2013. What to do about Freeling 3.0: Towards wider multilin-
bad language on the internet. In Proceed- guality. In Proceedings of the Eighth
ings of the 2013 Conference of the North International Conference on Language
American Chapter of the Association for Resources and Evaluation (LREC-2012),
Computational Linguistics: Human Lan- pages 2473–2479, Istanbul, Turkey.
guage Technologies (NAACL HLT 2013),
pages 359–369, Atlanta, USA. Weng, Jianshu and Bu-Sung Lee. 2011.
Event detection in Twitter. In Pro-
Han, Bo and Timothy Baldwin. 2011. Lex- ceedings of the Fifth International AAAI
ical normalisation of short text messages: Conference on Weblogs and Social Media,
Makn sens a #twitter. In Proceedings of Barcelona, Spain.
the 49th Annual Meeting of the Associa-
tion for Computational Linguistics: Hu-
man Language Technologies (ACL HLT
2011), pages 368–378, Portland, Oregon,
USA.
Han, Bo, Paul Cook, and Timothy Baldwin.
2012. Automatically constructing a nor-
malisation dictionary for microblogs. In
Proceedings of the 2012 Joint Conference
on Empirical Methods in Natural Lan-
guage Processing and Computational Nat-
ural Language Learning, pages 421–432,
Jeju Island, Korea. Association for Com-
putational Linguistics.
Han, Bo, Paul Cook, and Timothy Baldwin.
2013. Lexical normalisation of short text
messages. ACM Transactions on Intel-
ligent Systems and Technology, 4(1):5:1–
5:27.
Jiang, Long, Mo Yu, Ming Zhou, Xiaohua
Liu, and Tiejun Zhao. 2011. Target-
dependent Twitter sentiment classifica-
tion. In Proceedings of the 49th An-
nual Meeting of the Association for Com-
putational Linguistics: Human Language
Technologies (ACL HLT 2011), pages
151–160, Portland, Oregon, USA.
Lui, Marco and Timothy Baldwin. 2012.
langid.py: An off-the-shelf language iden-
tification tool. In Proceedings of the 50th
Annual Meeting of the Association for