Exploiting web-based collective knowledge for micropost normalisation Uso del conocimiento colectivo recogido en recursos de la Web para la normalización de textos cortos publicados en Twitter Óscar Muñoz-Garcı́a Silvia Vázquez, Nuria Bel Havas Media Group Universitat Pompeu Fabra Madrid - Spain Barcelona - Spain oscar.munoz@havasmg.com silvia.vazquez@upf.edu, nuria.bel@upf.edu Resumen: La tarea de normalización de contenido publicado por el usuario es un paso fundamental previo al análisis de las publicaciones en los medios sociales, espe- cialmente en Twitter. En este trabajo se presenta un método para la normalización morfológica de tweets mediante el uso de recursos publicados en la Web y desarrol- lados de manera colectiva, entre los que se encuentran la Wikipedia y un diccionario de SMS. Los resultados obtenidos demuestran que estos recursos son una fuente de conocimiento muy valiosa para la generación de los diccionarios utilizados en la tarea de normalización. Palabras clave: medios sociales, normalización de contenidos, Twitter, tweet-norm Abstract: The task of normalising user-generated content is a crucial step before analysing social media posts, particularly on Twitter. This paper presents a method for the morphological of tweets by the use of on-line and collectively developed resources, including Wikipedia and a SMS lexicon. The results obtained demonstrate that these resources are a valuable source of knowledge for generating the dictionaries used in the normalisation task. Keywords: social media, micropost normalisation, Twitter, tweet-norm 1 Introduction and objectives mantic similarity between texts (Gabrilovich Microposts published on social media are and Markovitch, 2007), and word sense dis- characterised by informality, brevity, fre- ambiguation (Mihalcea, 2007) among others. quent grammatical errors and misspellings, This paper presents a technique for mor- and by the use of abbreviations, acronyms, phological normalisation of microposts by the and emoticons. These features add addi- use of two open data sources namely, Wiki- tional difficulties in text mining processes pedia and the SMS dictionary of the Spanish that frequently make use tools designed for Association of Internet Users (AUI, 2013). dealing with texts which conform to the can- The paper is structured as follows. Sec- ons of standard grammar and spelling (Hovi tion 2 describes the architecture and the com- et al., 2013). ponents of the system. Section 3 describes The micropost normalisation task en- the linguistic resources that we have reused hances the accuracy of NLP tools when ap- for constructing the normalisation tool. Sec- plied to short fragments of texts published tion 4 presents the evaluation results. Fi- in social media, e.g., the syntactic normalisa- nally, Section 5 presents the conclusions and tion of tweets may improve the accuracy of future lines of work. existing part-of-speech taggers (Codina and Atserias, 2012). 2 Architecture and components of The collective knowledge freely available the system on the Web, and particularly Wikipedia, has been used in different NLP tasks, such as text Figure 1 shows the process followed by the categorization (Gabrilovich and Markovitch, micropost normaliser proposed. The specific 2006), topic identification (Coursey, Mihal- components involved in the overall process cea, and Moen, 2009), measuring the se- are described below. Standard Normalise Twitter vocabulary Metalanguage Element Twitter Normalised metalanguage forms element Concatenate Tokenize Classify Tokens Normalised Forms In vocabulary Micropost OOV word Normalised word Micropost Correct Variation Correct OOV Classify OOV word words NoES Variation Correct OOV word SMS Dictionary Check & Correct Spell Spell Checker Dictionary OOV word Figure 1: Normalisation process 2.1 Tokeniser tionary, neither are Twitter metalan- This component receives the text to be nor- guage elements. Each token classified in malised and breaks it into words, Twitter this category is sent to the OOV Word metalanguage elements (e.g., hash-tags, user Classifier component. IDs), emoticons, URLs etc. The output (i.e., the list of tokens) is sent to the Token Clas- 2.3 OOV Word Classifier sifier component. This component receives every token previ- ously classified as OOV by the Token Classi- 2.2 Token Classifier fier and detects if it is correct, wrong, or un- The input of this component is the list of known. If the token is wrong, the component tokens generated by the Tokeniser. It clas- returns the correct form of the token. The sifies each of them into one of the following OOV Word Classifier Component executes categories: the following process: • Twitter metalanguage elements (i.e., 1. Firstly, the token is looked up in a dic- hash-tags, user IDs, RTs and URLs). tionary of correct OOV words. The Such elements are detected by match- search disregards both case and accents. ing regular expressions against the token (e.g., if a token starts by the symbol (a) If an exact match of the token “#”, then it is a hash-tag). Each token is found in the dictionary (e.g., classified in this category is sent to the both forms are capitalised), then Twitter Metalanguage Normaliser com- the token is classified as Correct and ponent. sent to the Normalised Forms Con- catenator component with no vari- • Words contained in a standard language ation. dictionary, excluding proper nouns. (b) If the token is found with variations Each token classified in this category is of case or accentuation, then the sent to the Normalised Forms Concaten- token is classified as Variation and ator component. its correct form is sent to Normal- • Out-Of-Vocabulary (OOV) words. They ised Forms Concatenator compon- are words not found in a standard dic- ent. (c) If the token is not found in the dic- (c) If the spell checker is not able to tionary, then the process continues propose a correct form, the token in step 2. is classified as Unknown and sent to the Normalised Forms Concaten- 2. The token is looked up in a SMS diction- ator without a variation. ary which contains tuples with the SMS term and its corresponding correct form. 2.5 Twitter Metalanguage The search is case-unsensitive, and does not consider accent marks. Normaliser This component performs a syntactic norm- (a) If the token is found in the SMS alisation of Twitter meta-language elements. dictionary, then it is classified as Specifically, it executes a set of rules, pre- Variation and its correct form is viously proposed by (Kaufmann and Jugal, retrieved and sent to Normalised 2010). Forms Concatenator component. (1) Remove the sequence of characters (b) If the token is not found in the dic- “RT” followed by a mention to a Twitter tionary, then it is sent to the Spell user (marked by the symbol “@”) and, op- Checker and Corrector component. tionally, by a colon punctuation mark; (2) 2.4 Spell Checker and Corrector Remove user IDs that are not preceded by a coordinating or subordinating conjunction, This component checks the spelling of the a preposition, or a verb; (3) Remove the token received and returns its correct form word “via” followed by a user mention at when possible. To do so, it executes the fol- the end of the tweet; (4) Remove all the lowing process: hash-tags found at the end of the tweet; (5) 1. Firstly, the token is matched against reg- Remove all the “#” symbol from the hash- ular expressions to find whether it con- tags that are maintained; (6) Remove all tains characters (or sequences of char- the hyper-links contained within the tweet; acters) repeated more than twice (e.g., (7) Remove ellipses points that are at the “loooooollll” and “jajaja”). end of the tweet, followed by a hyper-link; (8) Replace underscores with blank spaces; (a) If the token contains repeated char- (9) Divide camel-cased words in multiple acters (or sequences of characters), words (e.g., “BarackObama” is converted to the repeated ones are removed (e.g., “Barack Obama”). “lol”, and “ja”), and the resulting form is sent back to the OOV Word 2.6 Normalised Forms Classifier, since the new form may Concatenator be included into the correct words set. This component receives the normalised form of each token, and amends the micropost. (b) If the token does not contain re- peated characters (or sequences of characters), then the process con- 3 Resources employed tinues in step 2. The system described makes use of the fol- 2. The token is sent to an existing spell lowing resources. checking and correction implementation We use Freeling (Padró and Stanilovsky, reused by this component. 2012) for microposts tokenisation. Its specific tokenization rules and its user map module (a) If the spell is correct, the token were adapted for dealing with smileys and is classified as Correct and sent particular elements typically used in Twitter, to the Normalised Forms Concat- such as hash-tags, RTs, and user IDs. enator component without a vari- In addition, we use the POS-tagging mod- ation. ule of Freeling within the Token Classifier (b) If the spell is not correct, the token component. As we deactivate Freeling’s is classified as Variation, and the probability assignment and unknown word first correct form returned by the guesser module, all the words which are not spelling corrector is sent to Norm- contained in Freeling’s POS-tagging diction- alised Forms Concatenator. ary are not marked with a tag and considered as OOV words. Our standard vocabulary is, Corpus Size Wikipedia SMS Both thus, the Freeling dictionary itself. Devel. 1 100 0.336 0.631 0.688 We have populated the correct OOV Devel. 2 500 0.317 0.634 0.66 Test 600 0.361 0.516 0.548 words dictionary (used by the OOV Word Classifier component) by making use of the list of articles’ titles from Wikipedia (Wiki- Table 1: Precision of the normalisation tool pedia, 2013). To speed-up the process of Corpus Wikipedia SMS querying the 2,447,932 Wikipedia articles’ Development 1 20.661% 47.107% titles, we uploaded them to a HBASE store Development 2 20.436% 51.188% (Apache, 2013). Test 27.497% 28.115% In order to increase the coverage of the correct OOV words dictionary, we incorpor- Table 2: Coverage of OOV words by ated into it a list of first names from the dictionary Spanish National Institute of Statistics (INE, 2013). This list contains 18,679 male names SMS dictionary contains a bigger percentage and 19,817 female names. of OOV words than the dictionary populated Additionally, we have populated the SMS with Wikipedia titles. dictionary and its corresponding correct forms, from the SMS dictionary of the Span- 5 Conclusions and future work ish Association of Internet Users (AUI, 2013), which contains 53,281 entries for Spanish. We presented a method for tweet normal- Finally, the Spell Checker and Corrector isation that relies on existing web resources component makes use of Jazzy (Jazzy, 2013), collectively developed, finding that such re- an open-source Java library. For the creation sources, useful for many NLP tasks, are also of the spell checker dictionary used by Jazzy, valid for the task of micropost normalisation. we made use of the Spanish and Mexican With respect to the future lines of work, dictionaries available on JazzyDicts (Jazzy- we plan to adapt the normaliser to new lan- Dicts, 2013). The resulting dictionary con- guages by the incorporation of the corres- tains 683,436 terms. ponding dictionaries and improving the ex- isting lexicons by the use of more available 4 Settings and evaluation resources, such as the anchor texts from in- tra wiki links. The evaluation of the technique previously Additionally, we plan to improve the nor- described was done by using two develop- malization of multiword expressions, as dif- ment corpora and a test corpus provided by ferent words should be transformed in just the organisation of the Tweet Normalisation one (e.g., “a cerca de” should be trans- Workshop at SEPLN 2013. Specifically, we formed into “acerca de”), as well as cases evaluated the performance of the OOV iden- where joined words should be splitted (e.g. tification, classification and correction tasks. “realmadrid”) by using existing word break- The accuracy of the normalization task for ing techniques, such as the one described in the Twitter metalanguage elements was not (Wang, Thraser, and Hsu, 2011). evaluated since it was out of the scope of the Finally, we will study how the normalisa- workshop challenge. tion process affects to different opinion min- Table 1 shows the results of the evalu- ing tasks, including sentiment analysis and ation, including the size of each evaluation topic identification. corpus (column 2), the precision obtained by using either Wikipedia or the SMS dictionary separately (columns 3 and 4 respectively), Acknowledgements and the overall precision achieved by exploit- This research is partially supported by the ing both dictionaries (column 5). Spanish Centre for the Development of In- As Table 1 reflects, both dictionaries help dustrial Technology under the CENIT pro- to improve the final precision score, being gram, project CEN-20101037, “Social Me- the SMS dictionary the one which contrib- dia” (http://www.cenitsocialmedia.es). utes the most. This can be explained with We are very grateful to AUI (Asociación the coverage of OOV words by each of the de Usuarios de Internet) for facilitating the dictionaries, which is shown in Table 2. The textese dictionary used in this work to us. References Kaufmann, Max and Kalita Jugal. 2010. Apache. 2013. HBase. http://hbase. Syntactic normalization of twitter mes- apache.org. [Online; accessed 25-Jul- sages. In Proceedings of the Interna- 2013]. tional Conference on Natural Language Processing (ICON-2010). AUI. 2013. Asociación de Usuarios de Inter- Mihalcea, R. 2007. Using wikipedia for net. http://aui.es. [Online; accessed automatic word sense disambiguation. In 24-July-2013]. Proc. of NAACL HLT, volume 2007. Codina, Joan and Jordi Atserias. 2012. Padró, Lluı́s and Evgeny Stanilovsky. 2012. What is the text of a tweet? In Freeling 3.0: Towards wider multilin- Proceedings of @NLP can u tag guality. In Proceedings of the Lan- #user generated content?! via lrec- guage Resources and Evaluation Confer- conf.org, Istanbul, Turkey, May. ELRA. ence (LREC 2012), Istanbul, Turkey, Coursey, K., R. Mihalcea, and W. Moen. May. ELRA. 2009. Using encyclopedic knowledge for Wang, Kuansan, Christopher Thraser, and automatic topic identification. In Proc. of Paul Bo-June Hsu. 2011. Web Scale NLP: the Thirteenth Conference on Computa- A Case Study on URL Word Breaking. In tional Natural Language Learning, pages Proceedings of the 20th international con- 210–218. Association for Computational ference on World Wide Web, pages 357– Linguistics. 366. ACM. Gabrilovich, E. and S. Markovitch. 2006. Wikipedia. 2013. Wikipedia:Database Overcoming the brittleness bottleneck us- download. http://en.wikipedia.org/ ing Wikipedia: Enhancing text categor- wiki/Wikipedia:Database_download. ization with encyclopedic knowledge. In [Online; accessed 23-May-2013]. Proc. of the 21st National Conference on Artificial Intelligence, volume 2, page 1301. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Gabrilovich, E. and S. Markovitch. 2007. Computing semantic relatedness using wikipedia-based explicit semantic ana- lysis. In Proc. of the 20th Int. Joint Con- ference on Artificial Intelligence, pages 6– 12. Hovi, Eduard, Vita Markman, Craig Mar- tell, and David Uthus. 2013. Analyz- ing microtext. In Papers from the 2013 AAAI Spring Symposium. Association for the Advancement of Artificial Intelligence, March. INE. 2013. INEbase: Operaciones es- tadı́sticas: clasificación por temas. http: //www.ine.es/inebmenu/indice.htm. [Online; accessed 8-April-2013]. Jazzy. 2013. Jazzy. http://jazzy. sourceforge.net. [Online; accessed 25- Jul-2013]. JazzyDicts. 2013. JazzyDicts. http://sourceforge.net/projects/ jazzydicts. [Online; accessed 25-Jul- 2013].