An Analysis of Twitter Corpora and the Differences between Formal and Colloquial Tweets∗ Análisis de Varios Corpus de Twitter y las Diferencias entre Tweets Formales y Coloquiales Meritxell Gonzàlez Oxford University Press, Oxford, United Kingdom Universitat Politècnica de Catalunya, Barcelona, Spain meritxell.gonzalezbermudez@oup.com Abstract: This work reviews recent publications addressing the Twitter translation task, and highlights the lack of appropriate corpora that represents the colloquial language used in Twitter. It also discusses the most well-know issues in the Twitter genre: the use of hashtags and the amount of OOVs, with especial focus in comparing the differences between formal and colloquial texts. Resumen: Este trabajo resume las publicaciones recientes en el área de la traducción automática de tweets, destacando la falta de un corpus que represente el lenguaje coloquial presente en Twitter. También se tratan los problemas más conocidos del género de Twitter: el uso de hashtags i la gran cantidad de palabras OOV, con especial enfoque en las diferencias entre tweets formales y coloquiales. Keywords/Palabras clave: corpus, tweets, hashtags, and OOV 1 Introduction occur, while the second has been typically The success and increasing popularity of addressed by combining large amounts of microblogging has raised the need to analyse general purpose data and smaller subsets of and process its content. Traditional methods domain specific datasets. The creation of for natural language processing fail when a gold standard in MT requires the use of applied over these texts. The reason parallel data that helps to assess the quality is not circumscribed to few nor simple of the output. issues. Roughly, microblogs documents do When addressing the automatic not follow the traditional structure of a translation within the microblogging formal text or document, they use a number genre, one has to deal with the additional of language variants, styles and registers difficulty of having little or no context and among other linguistic phenomena, and can the fact that microblogs exhibit fleeting even include multimedia content as a way of domains. Twitter is not different from communication (Jehl, 2010; Fabrizio Gotti other microblogs, and has, in addition, its and Phillippe Langlais and Atefeh Farzindar, own particularities. As described in (Jehl, 2014; Kaufmann, Max and Kalita, Jugal, 2010), tweets actually share the spontaneity 2010; Bertoldi, Nicola and Cettolo, Mauro and expressiveness of the spoken language, and Federico, Marcello, 2010). but limited to 140 characters. Due this Machine Translation (MT) is a hard task constraint, tweets have usually a very within the natural language processing field. simple syntax. However, they are mined It has received considerable attention during of ungrammaticalities, misspellings and an the last decades, and it is still an active field unlimited number of lexical variants created with many research challenges. As in other out of the human imaginary and the common natural language processing tasks, it counts ground of part of the audience. among its difficulties the ambiguity of the In this document, Section 2 summarises language, and the need of corpora and a gold recent studies in this field and different standard. The former can be addressed by approaches followed to address these analysing the context in which a sentence phenomena. Next, Sections 3 to 5 give a ∗ This work was partially funded by the TACARDI numerical analysis of 6 different corpora project (TIN2012-38523-C02) of the Spanish of tweets written in Basque, Catalan, and Ministerio de Economı́a y Competitividad. Spanish. The goal of this analysis is to sketch the content of the Twitter messages deal with the noisy input from colloquial (tweets), highlight which are their principal texts, but either they do not belong to characteristics and discuss the differences the Twitter genre or they do not contain between formal and colloquial tweets. parallel data. (Kaufmann, Max and Kalita, Jugal, 2010) describes an MT system able 2 Recent Work on Twitter to translate from colloquial English into Translation standard English. The rationale is that traditional NLP techniques can be applied The automatic translation of tweets, in over standardised text. Their methodology general, is more difficult than regular MT. includes the use of aligned data from a Although the MT community has already corpus of SMSs that contains most common addressed the translation of tweets, there are acronyms and short forms. (Bertoldi, Nicola still few works in this area, mainly because and Cettolo, Mauro and Federico, Marcello, of the lack of corpora, and especially those 2010) and (Formiga, Lluı́s and Fonollosa, showing a fair representation of colloquial José A. R., 2012) address the problem texts. The number of authors publishing of translating noisy input. The former content in multiple languages is not small, by trying to simulate and generate noisy but their messages tend to be correct and input automatically; the latter by adding a well structured, in contrast to those posted preprocessing layer to convert the input into by the gross of the users. clean text. Finally, the corpus described 2.1 Twitter Corpora in (Alegria, Iñaki and Aranberri, Nora and Comas, Pere R and Fresno, Vıctor The availability of parallel corpora for and Gamallo, Pablo and Padró, Lluis and Twitter is growing but still scarce. The San Vicente, Iñaki and Turmo, Jordi and following four works gathered parallel data Zubiaga, Arkaitz, 2014) was distributed to following diverse approaches, but them all the participants of the TweetNorm shared contain formal texts only. (Gotti, Fabrizio task. This is a monolingual corpus of Spanish and Langlais, Philippe and Farzindar, tweets. Since this corpus has been used in Atefeh, 2013) gathered data from Canadian this study it is further detailed in Section 3. Government Agencies, written in French and English. This work describes an MT 2.2 Linguistic Phenomena system that uses in-domain parallel data crawled from the links appearing in the Although the previous works addressed tweets. Hence, tuning was conducted with different problems, they share a common documents from the same domain. The ground on the principal difficulties of the corpus built in (Ling, Wang and Marujo, Twitter genre. First, the translation of Luis and Dyer, Chris and Black, Alan W hashtags is an open issue that includes its and Trancoso, Isabel, 2014) contains tweets segmentation, identification and analysis of written in Chinese and English. This work its role in sentences (Fabrizio Gotti and describes a tool and a methodology to help Phillippe Langlais and Atefeh Farzindar, users to identify parallel excerpts in the 2014). Second, the correct tokenisation of messages and to annotate their boundaries. the text is essential but difficult due the The data obtained with this method was extreme noisiness of the text. Also, making fairly cheap (crowd-sourced) and it resulted the translation fit in 140 characters can harm to have a high degree of quality. (Jehl, Laura the quality of the output, although (Jehl, and Hieber, Felix and Riezler, Stefan, 2012) 2010) addressed this issue in her thesis and used a corpus of Arabic sentences that were reported good results. manually translated into English. The data The increasing interest in the field has was crawled by filtering the topic (Arabic promoted the design of tools to create Spring) and was cleaned and pruned, also especialised corpora. However, the human by means of crowd-sourcing. Finally, the translation of tweets also raises open shared task described in (Alegria et al., 2015) questions (S̆ubert and Bojar, 2014). For distributed a collection of parallel corpora instance, how to translate idioms and slang, in the languages spoken in the Iberian out-of-vocabulary words, onomatopoeias, peninsula. These corpora have been used in emphasises (jajaaaaa), or irony. But also, this study and they are detailed in Section 3. how to approach the translation of hashtags In contrast, the following four works and symbols (such as emoticons), how to interpret wrong syntax, find the translated CAES.ca CAES.es version of a link, and fit the final translation # tweets 4, 000 4, 000 into 140 characters, among others. # tokens 66, 559 66, 113 All in all, the creation of synthetic corpus avg. tokens/tweet 16.39 16.53 to simulate these phenomena seem a feasible EUES.eu EUES.es approach (Bertoldi, Nicola and Cettolo, # tweets 4, 000 4, 000 Mauro and Federico, Marcello, 2010), yet out # tokens 58, 368 51, 782 of the scope of this study. Last, but not least, avg. tokens/tweet 14.59 12.94 an appropriate methodology and measures TNORM TSM to assess the quality of Twitter translations # tweets 1, 132 8, 571 including its particular characteristics has # tokens 14, 497 123, 679 not been addressed so far. avg. tokens/tweet 12.80 14.43 3 Description of the Used Table 1: Statistics on number of tweets and Corpora tokens in each corpus. The next sections analyse six datasets of tweets from the Tweet-Norm (Alegria, Iñaki general domain set of tweets randomly and Aranberri, Nora and Comas, Pere R selected. So similarly to TNORM, it and Fresno, Vıctor and Gamallo, Pablo and contains both formal and colloquial tweets. Padró, Lluis and San Vicente, Iñaki and They were manually processed to classify Turmo, Jordi and Zubiaga, Arkaitz, 2014), them according to the language of the Tweet-MT (Alegria et al., 2015) and Social tweet and annotate different layers such Media (Roser Saurı́, 2013) corpora. The as communication function, polarity, target, goal is to discuss a few of the phenomena and topic. This process included some clean mentioned in the previous section. up of the twitter mark-up for privacy reasons. A set of four datasets was obtained Hence, the author id and user mentions, from the Tweet-MT corpora. It consists hashtags and URLs were substituted with of 2 bitexts for Catalan–Spanish and the labels @USER, #HASHTAG and [URL], Basque–Spanish language pairs. The four respectively. datasets contain both, the development and The six datasets were processed to the test sets for each language: CAES.ca, have similar characteristics: the tokens CAES.es, EUES.eu and EUES.es. The that correspond to the author id and RT tweets in these datasets were obtained from (re-tweet) were removed when present, and a sample of manually selected accounts they were tokenised using an adaptation of authors that tend to tweet in various to Spanish and Catalan languages of the languages, being namely public organisations Twokenize tool (Brendan O’Connor and and personalities. Hence, the content of the Michel Krieger and David Ahn, 2010). messages is mainly formal, i.e., they do not Table 1 shows the number of tweets, the contain misspellings and do not abuse of the number of tokens and the average number of use of symbols. tokens per tweet in each corpus. Regardless The fifth dataset, TNORM, was obtained the differences in nature of the datasets and from the Tweet-Norm corpus that gathered a their size, they show a similar number of random selection of geolocated tweets within tokens per tweet, being CAES.ca the dataset the Iberian peninsula, excluding multilingual with longer ones and EUES.eu the shortest. areas where other languages than Spanish The messages in the two colloquial corpora are spoken. The corpus was processed TNORM and TSM seem to have slightly to identify and annotate out-of-vocabulary shorter posts compared with their formal words. Hence, it contains not only correct ones in the same language CAES.es and messages, but also colloquial ones. The EUES.es. dataset used in this work contains the two Although tweets are similar in length, development sets and the test provided in the a deeper analysis of their content shows workshop. remarkable differences between the formal The last dataset used in this work and the colloquial corpora. This section is TSM. It is a portion of the Social analyses the use of user mentions and Media Corpus, and in particular the corpus URLs whereas Section 4 analyses the use of tweets in Spanish. It contains a of hashtags. Although dealing with user CAES.ca CAES.es CAES.ca CAES.es # @users 743 873 # hashtags 3, 286 3, 821 avg. @users/tweet 0.18 0.22 # hashtag types 198 430 % @users wrt. tokens 1.13% 1.32% # avg. hashtags/tweet 0.82 0.96 % hashtags wrt. tokens 5.01% 5.78% # URLs 3, 511 3, 525 # tweets > 1 hashtag 1, 520 1, 750 avg. URLs/tweet 0.88 0.88 % URLs wrt. tokens 5.36% 5.33% EUES.eu EUES.es EUES.eu EUES.es # hashtags 4, 828 4, 608 # hashtag types 584 438 # @users 1, 947 2, 070 # avg. hashtags/tweet 1.21 1.52 avg. @users/tweet 0.49 0.52 % hashtags wrt. tokens 8.27% 8.90% % @users wrt. tokens 3.76% 3.55% # tweets > 1 hashtag 2, 358 2, 364 # URLs 3, 461 3, 458 TNORM TSM avg. URLs/tweet 0.86 0.86 % URLs wrt. tokens 6.68% 5.92% # hashtags 182 1, 046 # hashtag types 157 1 TNORM TSM # avg. hashtags/tweet 0.16 0.12 # @users 665 3, 439 % hashtags wrt. tokens 1.26% 0.85% avg. @users/tweet 0.59 0.40 # tweets > 1 hashtag 103 744 % @users wrt. tokens 4.59% 2.78% # URLs 69 743 Table 3: Statistics on hashtag use in each avg. URLs/tweet 0.06 0.09 dataset. % URLs wrt. tokens 0.47% 0.60% 4 On the Importance of the Table 2: Statistics on user mentions (@users) and URLs use in each corpus. Hashtag Occurrences This section analyses the use of hashtags in the datasets. This study and the next one in Section 5 follow the procedure in (Fabrizio mentions (@user) and links is not a big issue, Gotti and Phillippe Langlais and Atefeh they are discussed here to stand out how they Farzindar, 2014) that resulted very clear and are used in Twitter. Table 2 gives the figures appropriate to this end. Table 3 shows some for the use of @user and URLs in the body statistics on the occurrences of hashtags. of the messages. @user do not seem to follow The different number of hashtags between any pattern. The number of @user in the two formal and colloquial datasets is noticeable. bitexts of the TweetMT datasets is opposite: The former contains more than one hashtag the EUES datasets contain more than twice per tweet, whereas the latter contains a @user than the CAES ones, and almost three remarkable low number of them.1 It seems times the proportion of @user with respect to indicate that formal tweets tend to use to the number of tokens. Similarly, the hashtags to categorise its topic and, maybe, TNORM dataset shows a higher use of @user create a trend. This is also reflected in than the TSM one. It is worth to note that Figure 1: the most of the formal tweets, not all @user tokens have their counterpart in the bitexts, contain one or two hashtag, in the translated text, even though this token whereas the most of the colloquial ones have does not need to be translated. none. A more interesting issue is the translation In contrast, the use of URLs seems to be of hashtags. In terms of the number of consistent across the two types of datasets. occurrences, each side of the bitexts contain The four bitexts contain almost the same a similar amount. However, the number number of URLs, and we can find almost of hashtag types in CAES.ca is much lower one URLs in each tweet. In return, TNORM than the ones in CAES.es. A peer review and TSM contain a remarkable small number of the hashtag sets reveals that the Spanish of URLs, less than 0.1% per tweet. Out of versions contain more written variants than curiosity, the majority of URLs in the bitexts their counterparts in Catalan. For instance, link to documents in the same language the hashtag “#revistapremsa” (Catalan) has as the tweet. Given that the selected four variants in the Spanish text: “#revista”, authors post multilingual messages, it seems reasonable that they also link to the right 1 The number of hashtag types in TSM is 1 because URL when available. the corpus contains only the #HASHTAG label. CAES.ca CAES.es TSM 3 % tweets with a prologue 2.85% 3.42% TNORM % tweets with an epilogue 43.6% 49.48% % of # in a prologue 3.50% 3.61% EUES.es % of # in an epilogue 75.72% 73.46% EUES.eu EUES.eu EUES.es 2 CAES.es % tweets with a prologue 10.28% 10.90% CAES.ca % tweets with an epilogue 55.23% 55.13% % of # in a prologue 9.13% 10.63% % of # in and epilogue 57.27% 60.11% 1 TNORM TSM % tweets with a prologue 2.03% 2.39% % tweets with an epilogue 5.74% 3.83% % of # in a prologues 17.03% 20.08% 0 % of # in a epilogues 40.66% 35.09% 0% 20% 40% 60% 80% 100% Table 4: Statistics on hashtag (#) use as prologues and epilogues in each dataset. Figure 1: % tweets with exactly n hashtags, for n ∈ [0, 1, 2, 3]. colloquial texts, roughly half of them appear inline, and hence, they play a syntactic role “#revistadeprensa”, “#revistaprensa”, and in the message. This is important since “#revistaprensa”. they may contain an essential part of the According to (Fabrizio Gotti and semantics and thus worth to deal with them. Phillippe Langlais and Atefeh Farzindar, Unfortunately, hashtags contains mainly of 2014), hashtags can be classified by the out-of-vocabulary words, as discussed next role they play in the text. They distinguish in Section 5. between hashtags that appear at the beginning of the text (prologue), in the text 5 On the OOV words in Twitter (inline) and at the end of the text (epilogue). The use of out-of-vocabulary (OOV) words Correctly identifying this role is important in Twitter has been claimed to be a hard since a number of hashtags may have a issue. The reason is not only the high number syntactic function inside the text (inline), or of misspellings, symbols and orthographic can help to identify the domain of the text errors, that could be partially tackled by (prologue and epilogue). A simple heuristic using spell-checkers, but also the use of was used to split the tweets into these three specific lexica and lexical variants. For parts, and the results shown are in line with instance, the use of word combinations (e.g., the mentioned study. We can observe, in in hashtags), the combination of different Table 4, how the hashtag role within the languages (especially in multilingual regions, text varies in each corpus. Although in but also English terms) and the unlimited different proportion, the gross of hashtags in ability of the microblogging sphere to invent the formal datasets appear in the epilogue, new terms. which indicates there is a common practice This section gives a numerical analysis to add any hashtag at the end of the tweet. of OOVs that occur in Twitter. In order In contrast, the colloquial datasets have a to conduct this analysis, the datasets were very few proportion of tweets with either processed to remove the user mentions and a prologue or an epilogue, but a higher URLs, since them all are tokens that do proportion of them appear in the prologues not need to be translated. Some variants (in comparison to the formal tweets). of the datasets were built. First, only the This behaviour may simply indicate that CAES bitext was used due the lack of a colloquial tweets do not follow necessarily Language Model (LM) for Basque. Then, any common practice. All datasets actually since the TNORM annotations provide the exhibit a low rate of tweets having a prologue, corrected forms for some OOV tokens (only although the EUES bitext show a remarkable spelling variants), they were used to build higher number in comparison to the rest. a new dataset TNORM-S were OOVs were Finally, it is worth to note that, although substituted with the correct word when the number of hashtags is lower in the available. In addition, two different versions CAES.ca CAES.es TSM CAES.es # OOV - clean data 5.61% 5.14% # OOV - clean data 11.08% 11.26% # OOV - no hashtags 2.81% 2.20% # OOV - no hashtags 10.30% 7.51% ppl - clean data 603 644 ppl - clean data 591 735 ppl - no hashtags 520 543 ppl - no hashtags 591 669 TNORM TNORM-S Table 6: Count of OOVs and the perplexity # OOV - clean data 14.23% 12.45% # OOV - no hashtags 13.53% 11.79% (ppl) in the TSM and CAES.es corpora using ppl - clean data 1, 325 1, 211 a LM trained on the TNORM corpus. ppl - no hashtags 1, 300 1, 192 TSM available out of the two colloquial ones. The # OOV - clean data 9.18% new LM was used to obtain the % of OOVs # OOV - no hashtags 8.38% and perplexity estimations on CAES.es and ppl - clean data 1, 370 ppl - no hashtags 1, 373 TSM datasets. The results are shown in Table 6. The % of OOVs is higher in both Table 5: Count of OOV and perplexity (ppl) cases, most probably due the small size of the estimation in each corpus using a LM trained corpus. However, the perplexity of the TSM on the “El Periódico” corpus. (This parallel dataset has decreased. This seems to indicate corpora is listed in the ELRA catalog) that the LM was able to capture a high proportion of the particular characteristics of colloquial tweets, and that these may be were created out of each dataset. In the recurrent in the colloquial genre and do not first one (clean data), the hashtags were kept appear in formal texts. (the # symbol was removed) since they play an important role in the text, carry part 6 Conclusions and Further Work of the semantics of the message and need to be translated in most of the cases. In Twitter has its own particularities that the second dataset, all the hashtags were makes it a hard genre to deal with. This removed. The purpose of this second version work reviews recent publications that address is to highlight the impact of hashtags in the the problem of Twitter translation. The perplexity estimation of the texts. number of works in this field is still scarce Table 5 shows the results of this analysis. due the lack of corpora, but also because As expected, colloquial datasets contain a of the lack of a gold standard and specific higher number of OOVs. The TNORM-S evaluation methodologies that can help to contains slightly a lower number of them in assess the quality of a tweet translation. comparison to the non-normalised version, This work also discusses the most well-know which indicates that the use of spell-checkers issues in the Twitter genre: the use of and the substituion of lexical variants in hashtags and the amount of OOVs, with not enough to deal with OOVs. This is especial focus on comparing the differences reflected in the figures on the perplexity of between formal and colloquial texts. The the datasets. The perplexity is high across results obtained are preliminary, but they all the datasets, and it slightly decreases clearly show that these two registers are after removing the hashtags from the data, different not only from a linguistic point of indicating that the language used in the text view, but also in terms of tweet structure is notable different from the LM. This can and content. Further work has to be done be ascribed to the fact that the LM was to align the hashtags and the OOVs in build using an out-of-domain corpus. In bitexts corpora and analyse the way their turn, removing the hashtags from the data are translated. Also, the annotation layers decreases the amount of OOVs, and seems of the TSM corpus enables the possibility to have an impact only in the formal dataset, to fine-grain the study, for instance, by where half of the OOVs occur in the hashtags. focusing in the differences between tweets However, their proportion is smaller when with different communication functions. To compared with the colloquial datasets. conclude, no major differences were found For the sake of comparison, the same between languages, but this may be ascribed calculation was carried on using a LM trained to the fact that the datasets were obtained on TNORM corpus, the only corpus publicly from bitexts corpora. References Solutions. In Proceedings of the Workshop Alegria, Iaki, Nora Aranberri, Cristina on Language Analysis in Social Media, Espaa-Bonet, Pablo Gamallo, Hugo G. pages 80–89, Atlanta, Georgia, June. Oliveira, Eva Martı́nez, Iaki San Vicente, ACL. Antonio Toral, and Arkaitz Zubiaga. Jehl, Laura. 2010. Machine Translation for 2015. Overview of TweetMT: A Shared Twitter. Master’s thesis, University of Task on Machine Translation of Tweets at Edimburgh, United Kingdom. SEPLN 2015. In Proceedings of the Tweet Translation Workshop co-located with 31th Jehl, Laura and Hieber, Felix and Riezler, Conference of the Spanish Society for Stefan. 2012. Twitter Translation Using Natural Language Processing, Alacant, Translation-based Cross-lingual Retrieval. Spain, September. In Proceedings of the Seventh Workshop on Statistical Machine Translation, WMT Alegria, Iñaki and Aranberri, Nora and ’12, pages 410–421, Stroudsburg, PA, Comas, Pere R and Fresno, Vıctor and USA. ACL. Gamallo, Pablo and Padró, Lluis and San Vicente, Iñaki and Turmo, Jordi and Kaufmann, Max and Kalita, Jugal. 2010. Zubiaga, Arkaitz. 2014. TweetNorm es Syntactic normalization of Twitter Corpus: an Annotated Corpus for Spanish messages. In Proceedings of the Microtext Normalization. In Proceedings International Conference on Natural of the Ninth International Conference on Language Processing, Kharagpur, India. Language Resources and Evaluation. Ling, Wang and Marujo, Luis and Dyer, Chris and Black, Alan W and Bertoldi, Nicola and Cettolo, Mauro and Trancoso, Isabel. 2014. Crowdsourcing Federico, Marcello. 2010. Statistical High-Quality Parallel Data Extraction Machine Translation of Texts with from Twitter. In Proceedings of the Misspelled Words. In Proceedings of the Ninth Workshop on Statistical Machine 2010 Annual Conference of the North Translation, pages 426–436, Baltimore, American Chapter of the ACL, pages Maryland, USA, June. ACL. 412–419. ACL. Roser Saurı́. 2013. Corpus de Dominio Brendan O’Connor and Michel Krieger Genérico y Especı́ficos (Inglés, Español, and David Ahn. 2010. TweetMotif: Catalán y Portugués). Technical report, Exploratory Search and Topic Social Media. Métodos y Tecnologı́as para Summarization for Twitter. In los medios sociales. Programa CENIT Proceedings of the International 2010 (CEN-20101037). Conference on Web and Social Media (ICWSM). The AAAI Press. S̆ubert, Eduard and Ondr̆ej Bojar. 2014. Fabrizio Gotti and Phillippe Langlais Twitter crowd translation – design and and Atefeh Farzindar. 2014. Hashtag objectives. In Translating and the Occurrences, Layout and Translation: Computer 36, pages 217–227, Geneva, A Corpus-driven Analysis of Tweets Switzerland. AsLing, The International Published by the Canadian Government. Association for Advancement in Language In Proceedings of the Ninth International Technology, Editions Tradulex; AsLing. Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may. ELRA. Formiga, Lluı́s and Fonollosa, José A. R. 2012. Dealing with Input Noise in Statistical Machine Translation. In Proceedings of COLING 2012: Posters, pages 319–328, Mumbai, India, December. Gotti, Fabrizio and Langlais, Philippe and Farzindar, Atefeh. 2013. Translating Government Agencies’ Tweet Feeds: Specificities, Problems and (a few)