Weighted and unweighted transducers for tweet normalization Mans Hulden Jerid Francom University of Helsinki Wake Forest University mans.hulden@helsinki.fi francojc@wfu.edu Resumen: En este artı́culo se presentan dos estrategias basadas en transductores de estados finitos para la normalización de tweets. La primera de ellas se basa en reglas de corrección creadas manualmente y diseñadas para capturar las erratas y abreviaturas utilizadas más comúnmente, mientras que la segunda intenta construir automáticamente un modelo de errores a partir de un corpus etiquetado (gold stan- dard) de tweets previamente normalizados. Palabras clave: Normalización de tweets, tranductores de estados finitos, reglas fonológicas, canal con ruido. Abstract: We present two simple finite-state transducer based strategies for tweet normalization. One relies on hand-written correction rules designed to capture com- monly occurring misspellings and abbreviations, while the other tries to automati- cally induce an error model from a gold standard corpus of normalized tweets. Keywords: Tweet normalization, finite-state transducers, phonological rules, noisy channel 1 Introduction 2 Tools We have employed two different toolkits in our correction strategy. The unweighted We present two different finite-state trans- transducers for manually designed correc- ducer (FST) based strategies for correction tion rules were compiled using the foma- and normalization of tweets. While both toolkit1 (Hulden, 2009). For manipulating methods use FSTs in the recovery of normal- the weighted transducers employed in the ized forms of words, the two employ differ- noisy channel model approach, the Kleene 2 ent strategies for normalization. The first (Beesley, 2009) finite-state programming lan- method is entirely rule-based; that is, the guage was used. Both approaches rely on goal is that a linguist provides rules for cor- a dictionary taken from the FreeLing suite rection and normalization of tweets based on (Carreras et al., 2004). observations of commonly recurring patterns of abbreviations or misspellings. This is ac- 3 Correction rules as FSTs complished by compiling correction rules into The goal with handwritten FST correction unweighted FSTs in a hierarchical pattern rules is to attempt a battery of corrections where simple corrections are attempted be- to unknown input words to transform them fore more complex ones. The second method into normalized forms found in a dictionary. attempts to induce the normalization pat- The corrections are expressed as contextually terns from development data directly by as- conditioned phonological replacement rules, suming a noisy channel model. Here, an error converted into transducers, and combined in model of character-to-character transforma- a hierarchical fashion in conjunction with a tions is learned from the data and, combined dictionary. They reflect phenomena such as: with a language model also represented as a transducer, the most probable correction can 1 http://foma.googlecode.com 2 be calculated. http://kleene-lang.org • diacritic restoration (a → á, ny → ñ, . . . ) and (3), etc. etc. This type of a hierarchi- cal replacement strategy can be implemented • commonly occurring spelling errors (v → through standard regular expression opera- b, d → ∅, . . . ) tors available in conjunction with a dictio- • repetition removal (aaa. . . → a) nary encoded as an automaton (Beesley and Karttunen, 2003). • common abbreviations/errors (ke → qué/que, tqm/tkm → te quiero mucho) 4 Noisy channel FST modeling The correction rules themselves range The fundamental idea behind noisy chan- from the very specific to highly generic. For nel modeling is to choose a w to maxi- example, a simple rule to attempt to restore mize the probability P (w|s), where s is the missing d-characters in past participles looks “scrambled” word and w the normalized, in- in the foma notation as follows: tended form. This quantity is proportional to P (s|w)P (w), where the first factor is called define RULEPart [..] (->) d || [a|i] _ o (s) .#. ; the error model—the probability that a word w was perturbed to yield s—and the second reflecting the idea that d-characters are to factor the language model, the probability be inserted when preceded by a or i, and fol- that w occurs. lowed by o, and optionally s at the end of a Weighted FSTs provide a convenient cal- word. culus for addressing the above task, assum- At the same time, a very specific rule that ing we can encode the probabilities of words only addresses the high-frequency word-form P (w) and perturbations P (s|w) as proba- Twitter and its various misspellings and al- bilistic or weighted transducers somehow. ternative forms appears as When the error model (EM), the language model (LM), and a scrambled input word s define RuleTW1 [[t|T]+ [w|u]+ Vow+ t+ Vow+/h (r)]: are available to us as weighted transducers, {Twitter}; we may calculate the composition in effect accepting such forms as Tuiter, tui- W = s ◦ EM ◦ LM (1) itah, twittr, etc., and mapping all these to Twitter. and subsequently find the least-cost string in In total, about 20 rules were designed. the output projection of W . Here, the trans- Some rules count as single rules in the com- ducer s is simply an acceptor that repeats the bination hierarchy: for example, diacritic to-be-corrected input word with probability 1 restoration rules (a → á, e → é, etc.) are (or cost 0, as we are operating with negative considered a single rule for the purposes of log probabilities, or weights). Since trans- combination. ducer composition, projection, and least-cost The rules themselves could be contextu- paths are easily calculated, what remains to ally conditioned outside the current word as be done is to construct the transducers that well. This may be useful for performing cor- assign a cost (probability) to character per- rections based on syntactic context. How- turbations (EM) and words (LM). ever, for the experiments conducted here, all rules refer only to contexts within the same 4.1 Inducing an error model word. To assign probabilities to perturbations, we The rules in question are combined in a first align the individual characters of word- hierarchical order using a type of if-then-else pairs in the development corpus using a Chi- logic. That is, we can, for example, first ap- nese Restaurant Process (CRP) (Ferguson, ply corrections to (1) known abbreviations. If 1973) aligner. This is very similar to align- that fails to produce a legitimate word in the ing the word-pairs by minimum edit dis- dictionary, we (2) apply diacritic restoration. tance (MED). After the alignment is per- If that fails, we apply some common error formed, we work under the assumption that pattern (3). Should that fail, apply both (2) the character-to-character correspondences are all independent—that is, not conditioned $^correct($word) { upon other transformations in the same word return $^shortestPath( or elsewhere—and estimate the probability of $^lowerside($word _o_ $EM _o_ $LM) each input symbol x being transformed to y ); as: } c(x : y) + α p(x → y) = (2) This reflects exactly the calculation given N + |Γ|α in (1) and the entire function is equivalent using the smoothing parameter α (set to 0.1) to finding argmax P (s|w) P (w) in the noisy to avoid zero counts, N being the number of channel model presented above. That is, we observed character pairs, and Γ our alphabet compose the input word with the error model of possible symbol pairs. Note that zeroes, and language model, extract the output pro- representing deletions and insertions are also jection and find the least-cost (most proba- possible. ble) path. For example, our aligner outputs align- When this strategy is employed, we also ments such as: need to establish a cutoff probability to choose when to simply accept a word as is. pasa_o t__o Obviously, the error model allows for any pasado todo word to change into any other word, albeit at a low probability, and we do not want to allow Tambien Adioos extreme changes to produce a word in the dic- También Adió_s tionary, but rather accept the unknown word From this, we may calculate probabili- without change. ties of inserting an o ( :o), inserting a d, Apart from calculating corrections based ( :d), repeating a p (p:p), etc., and construct on a language model and error model, we can a transducer representing the error model also incorporate previously “known” correc- (EM ) that performs such perturbations with tions collected from a gold-standard set of the intended probabilities. Note that while corrected words, if such a resource is avail- we assume a context-independent probabil- able. If we assume access to a set of al- ity for changes, there is no principled reason ready known word-word pairs that represent why complex contexts could not be accom- corrected input words, we can build another modated as well.3 transducer that directly maps such cases to their correct form at a low cost. This can be 4.2 Calculating corrections helpful because, for example, abbreviations For the language model, LM , we have chosen such as tqm → te quiero mucho would likely to rely mainly on simple unigram frequen- produce a high cost using the character- cies for words in the FreeLing lexicon, where to-character based noisy channel corrector frequencies were collected from a Wikipedia because of the large number of character dump. Such a model is particularly simple to changes required to produce the target form. produce a transducer from, but again, there However, this can be addressed by building is no principled reason why more complex a transducer EX that models such previously models could not be used. known word-word pairs and assigns these cor- Assuming the existence of transducers la- rections a low cost, and the entire calculation beled $word, representing the current word can be performed as: to be corrected, an error model $EM, and a language model $LM, as described above, the W = s ◦ ((EM ◦ LM ) ∪ EX) (3) correction can be described as a function in the Kleene programming language as follows: This would then model both a lookup from 3 For example, inserting a d to yield a past partici- a “known” list of cheap corrections as well as ple, as in pasao→pasado should reasonably be more the noisy-channel error-model and language likely than inserting a d in other contexts. model corrections. System Accuracy combining the hand-written and data-driven approaches to yield a hybrid system. Rules 63.1 One perhaps profitable avenue of further Noisy Channel 61.4 experimentation is to use the contextual rules developed by the linguist and learn probabil- ities for those rules. For example, under the Table 1: Results from development set. current noisy channel model, the perturba- tion ∅ → u (insert u) is learned without ref- System Accuracy erence to context and deemed equally likely Rules 60.4 regardless of context. However, the hand- written rule allows that transformation pri- Table 2: Results from test set. marily following a q, reflecting errors such as *qiero and *aqi. Quick improvements also appear possible 5 Evaluation by investing in a good dictionary of named entities as tweets appear to contain proper To evaluate the performance of the methods, names with much higher frequency than in we divided the gold standard corpus into five running text in general. For the current parts to be able to perform cross-validation. work, entities such as movie names, compa- This is necessary, in particular, for the noisy nies, celebrities, etc. were not used in the channel model where the learning data must development. Automatic induction of such a necessarily come from a subset of the gold database from large numbers of tweets should corpus with normalized tweets. also be considered. The results are given in tables 1 and 2. As is seen, the manual correction rules slightly References outperform the corrections produced by the Beesley, Kenneth R. 2009. The Kleene weighted transducer method. language for weighted finite-state pro- gramming. In Finite-state Methods 6 Discussion & Future Work and Natural Language Processing: Post- We have compared and evaluated two differ- proceedings of the 7th International Work- ent approaches to tweet normalization, one shop FSMNLP; Edited by Jakub Piskorski, being more of an expert system developed Bruce Watson and Anssi Yli-Jyrä, volume based on observation of tweets and the other 191, page 27. IOS Press. a statistical one based on inducing correc- tion rules from a gold standard corpus. As Beesley, Kenneth R. and Lauri Karttunen. it stands, the noisy channel corrector suffers 2003. Finite State Morphology. CSLI from the difficulty of constructing a reliable Publications, Stanford, CA. error model from the limited number of gold- Carreras, Xavier, Isaac Chao, Lluis Padró, standard corrections and the fact that the and Muntsa Padró. 2004. FreeLing: An unigram language model is relatively poor. open-source suite of language analyzers. This sometimes leads to drastically incor- In LREC Proceedings. rect conclusions. For example, the corrector Ferguson, T. S. 1973. A Bayesian analy- would produce from the input cansao the sis of some nonparametric problems. Ann. output casa and not the desired cansado, Stat., 1:209–230. since in a unigram context, the high likeli- hood of casa outweighs the cost of having Hulden, Mans. 2009. Foma: a finite-state to delete two characters. Such errors are fre- compiler and library. In Proceedings of quent and could probably be addressed by the 12th conference of the European Chap- both a richer language model and a more re- ter of the Association for Computational fined error model. Linguistics,, pages 29–32. Association for Apart from obvious improvements to the Computational Linguistics. language model, there is also potential for