=Paper=
{{Paper
|id=Vol-1228/paper3
|storemode=property
|title=TweetSafa: Tweet Language Identification
|pdfUrl=https://ceur-ws.org/Vol-1228/tweetlid-3-mendizabal.pdf
|volume=Vol-1228
|dblpUrl=https://dblp.org/rec/conf/sepln/MendizabalCH14
}}
==TweetSafa: Tweet Language Identification==
TweetSafa: Tweet language identification TweetSafa: Identificación del lenguaje de tweets Iosu Mendizabal Jeroni Carandell & Daniel Horowitz (IIIA) Artificial Intelligence (UPC) Universitat Politècnica de Catalunya Research Institute (URV) Universitat Rovira i Virgili (CSIC) Spanish Council for (UB) Universitat de Barcelona Scientific Research jeroni.carandell@gmail.com iosu@iiia.csic.es daniel.horowitzzz@gmail.com Resumen: Este artı́culo describe la metodologı́a utilizada en la tarea propuesta en SE- PLN 14 para la identificación de lenguaje de tweets (TweetLID), como se explica en (Iñaki San Vicente, 2014). El sistema consta de un preprocesamiento de tweets, creación de dic- cionarios a partir de N-Grams y dos algoritmos de reconocimiento de lenguaje. Palabras clave: Reconocimiento de lenguaje, lenguaje de tweets. Abstract: This paper describes the methodology used for the SEPLN 14 shared task of tweet language identification (TweetLID), as explained on (Iñaki San Vicente, 2014). The system consists of 3 stages: pre-processing of tweets, creation of a dictionary of n-grams, and two algorithms ultimately used for language identification. Keywords: Tweet identification, tweet language. 1 Introduction and objectives guistic processing. Language identification is vital as a prelimi- The rest of the article is laid out as fol- nary step of any natural language processing lows: Section 2 introduces the architecture application. The increasing use of social net- and components of the system: the pre- works as an information exchange media is processing state where the tweets are adapted making of them a very important informa- to a better comprehension for our algorithm tion center. Twitter has become one of the and the used algorithms. Afterwards, section most powerful information exchange mecha- 3 describes our results for the given problem. nisms and every day millions of users upload To conclude, in section 4 we will try to draw tons of tweets. some conclusions and propose future works. The SEPLN 2014 TweetLID task focuses on the automatic identification of the lan- 2 Architecture and components of guage in which tweets are written, as the the system identification of tweet language is arousing We have presented two different approaches an increasing interest in the scientific com- to the problems which have been presented in munity (Carter, Weerkamp, and Tsagkias, track one (constrained) and track two (un- 2013). Identifying the language will help to constrained). Both of these methods share apply NLP techniques subsequently on the great part of the process in terms of the set tweet such as machine translation, sentiment of tweets being used to learn from, as well analysis, information extraction, etc. Accu- as the way incoming tweets are preprocessed rately identifying the language will facilitate and learned. the application of resources suitable to the language in question. 2.1 Pre-processing The scope of this task will focus on the The first step of this process, is to identify top five languages of the Iberian Penin- the noise present in all tweets regardless of sula:Spanish, Portuguese, Catalan, Basque, the language. There are common issues re- and Galician as well as English. These lated to regular text, such as multiple space languages are likely to co-occur along with characters, but also specific Twitter tokens many news and events relevant to the Iberian like the user name tag or emoticons. After Peninsula, and thus an accurate identifica- identifying this issues, we are able to remove tion of the language is key to make sure that them using mostly regular expressions. We we use the appropriate resources for the lin- have highlighted the main issues found in the tweet domain and what our approach was to- distribution and comparing it with the lan- wards it: guages distributions. To do so, we took two different approaches. • Different case characters: All characters were lowercased, so they wouldn’t inter- 2.3.1 Linear Interpolation fere in the identification process, since The first method tries to find out what the the same character with different cases probability is of a sentence being generated is treated as two different elements. by each language by multiplying the prob- ability of the consecutive N-Grams of the • Numbers, Emoticons: Since these kind sentence in their respective languages. The of characters are presented equally in problem appears when we deal with a small any language, they have been removed. finite dataset and there are therefore not • Vowel repetitions: The vowel repeti- enough instances to reliably estimate the tion is a common issue when dealing probability, in other words, the sparse data with chatspeak. These kind repetitions problem appears. This means that if a corpus could damage the algorithm’s perfor- of a certain language does not have a certain mance, therefore they were completely N-Gram, a sentence with the latter would au- removed and reduced to a maximum of tomatically have a probability of zero. two from the text using regular expres- To avoid this problem in the computation sions. of the probabilities of each tweet for the lan- guages of our N-Gram distribution we use the • Multiple spaces: This is also a common linear interpolation smoothing method, also issue when dealing with tweets. The known as the Jelinek-Mercer smoothing (Je- regular expression formats the text from linek, 1997),(Huang, Acero, and Hon, 2001). multiple spaces into one space character. To be able to use this smoothing method When working with N-grams, it is important we have to make a computation with our to observe that not all special characters are N-Gram corpus, the one generated with the to be removed from the text, since they could 14991 tweets for the training purpose, to cal- interfere in the identification process. Char- culate the λ values. We create a dynamic acters like the apostrophe, are more likely to program to compute as many λ values as the appear in English and Catalan than in oth- N-Grams we extracted from the training set. ers, therefore this kind of special characters For instance, if we consider up to 5 N-Gram must not be considered as noise, and we save distributions for English we will compute 5 them for a better result. λ’s for each N-Gram up to 5, so all λi corre- sponding to the i-Gram where i ∈ {1, ..., 5}. 2.2 N-GRAM distribution The probability of an N-Gram will be com- To classify the tweets into languages using puted as follows: N-grams we have to extract meaningful dis- tributions from each language. To do so, n i−1 P(tn |t1 , t2 , ..., tn−1 ) = λi P̂ (tn | P T we created documents of concatenated tweets tn−j ) i=1 j=1 for each language: English, Spanish, Cata- (1) lan, Portuguese, Galician, Basque, other and undetermined. Mixed labelled tweets such as the ones with ’en+es’ as well as those For any n and where P̂ are maximum like- with ambiguous languages ’en/es’ are added lihood estimates of the probabilities and n to both languages they contain (in this case P λi = 1, so P represent probability distri- to both Spanish ’es’ and English ’en’). Then i=1 we extract N-gram distributions in a dynamic butions. way so that we can choose the number of N The values of λ are computed by deleted we wish. interpolation (Brants, 2000). This technique successively removes each max-gram (biggest 2.3 Algorithms n-gram) from the training corpus and esti- Once we have N-gram distributions for each mates the best values for the λ’s from all language, given a new tweet we want to clas- other n-grams in the corpus by adding a con- sify we are going to find the most possible fidence to the lambdas for the most propor- language by extracting the tweet’s N-gram tionally seen N-Gram . The algorithm is given in Algorithm 1. language for a fixed n-gram. Given the tweet ranking {Tin }i and a language L n-gram rank- set λ1 = λ2 = λ3 = ... = λn = 0; ing {Lnj }j , the distance is computed by the foreach MAX-Gram (t1 , ..., tn ) with sum of the number of indexes that an element count (t1 , ..., tn ) > 0 do of T has been displaced in list D. So we sum | i − j | for every Tin in the tweet that is equal depending on the maximum of the to Lnj . In the case that an element in {Tin }i next values: does not exist in list {Lnj }j , we suppose the case count(t N −1 n )−1 : increment λ1 by best case, i.e. that the non appearing ele- count(t1 , ..., tn ) ment is in the bottom of the list. This as we will discuss in section 4 might not have been end such a good idea. Finally, to be able to com- case count(t n−1 ,tn )−1 count(tn−1 )−1 : increment λ2 pare different distances we need some kind of by count(t1 , ..., tn ) proportion of the out-of-place measure that end we describe as: case count(t n−2 ,tn−1 ,tn ))−1 count(tn−2 ,tn−1 )−1 : increment outOf P laceM easure (2) λ3 by count(t1 , ..., tn ) length(Tin i ) ∗ length(Lnj j ) end As we can see in Figure 1, the out-of-place ... count(t1 ,...,tn )−1 measure is calculated for a tweet from an En- case count(t1 ,t2 ,...,tn−1)−1 : increment glish dataset. The m and n parameters give λn by count(t1 , ..., tn ) us the maximum number of elements we al- end low for each list so that computational time end does not get compromised by an unnecessary Algorithm 1: Deleted interpolation Algo- whole search of all the n-grams in a language rithm. (Cavnar, Trenkle, and others, 1994). This is the part of this algorithm that makes it unconstrained, since the parameters we used 2.3.2 Out-of-place measure came from a previous similar project we did For this next method, for every n we will only using self downloaded tweets and where we consider a ranking list of n-grams ordered by found that the values of m=80 and n=50 most to least frequent and where only the were best. To avoid possible divisions by order is preserved as opposed to the exact frequencies. We decided to do this because when it comes to comparing a single tweet (documents of only 140 characters) to distri- butions of each language, we cannot consider that the frequency distribution of the tweet n-grams will resemble the ones in the con- catenated document. We can however, say that the most frequent have a higher proba- bility of appearance, but not necessarily with proportional frequencies as in the document. For this reason, we used the out-of-place mea- sure. Figure 1: Example of an out-of-place measure We decided to send this method as uncon- strained because two of the parameters which zero in equation 2, given that tweets are we used, that will be discussed later on, were sometimes zero or very close (especially af- extracted from a previous work we did with ter the cleaning of html’s, punctuation, etc), a self downloaded corpus of tweets of differ- we supose that if the number of characters ent languages. We did this because it would is smaller than three, the tweet is undeter- take too long if we had to find the new values mined. Again, a bold affirmation which needs because of the huge search space. to be fine-touched in future work. This measure is a distance which will tell Finally, in the training process, we are us approximately how far the tweet is from a going to reward each n-gram if it correctly guessed a tweet. So if for example, a trigram In figure 2 we can see the results of the labels a tweet correctly but the unigrams and experiments we made using the linear inter- bigrams do not, we reward the trigrams with polation method. We can observe how the one point where the others don not get any. results are going better while the N-Grams We do this with all the tweets in the train- are going bigger, but the peak of the results ing set and in the end we get frequency of are achieved with the 5-gram, from there on reliability of each n-gram. When the test is the results are slightly worst each gram we done on a tweet, a weighted voting is done us- sum. ing these confidence parameters so that the most voted languages counting the reliability weight wins. 3 Setup and evaluation The official result of our approach are the next ones: In the constrained category us- ing the linear interpolation algorithm, section 2.3.1, we obtained a precision of 0.777, a re- call of 0.719 and a F-measure of 0.736. In the unconstrained category we used the out-of-place measure algorithm, section 2.3.2, and obtained the next results: precision of 0.598, recall of 0.625 and F-Measure of 0.578. 3.1 Empirical settings Before submitting the final results we made different executions with different maximum Figure 2: Results obtained for the training N-Grams to know which was the one with the set with: Linear interpolation method. best results. Also because of the ambiguity of tweets with more than one language, for Because of these results, we decided to instance es+en, to compute this we take av- send the 5-gram results for the test set given erage value of all the probabilities of all the for the SEPLN 2014 task. languages and then create a threshold. For In the case of the ranking based method, the linear interpolation we used: we do not have to test different n-gram com- binations since we obtain a reliability for each maxP robability − Average n-gram to be truthful. So if a certain n-gram T hreshold = α were systematically wrong it would have a (3) very low confidence which would not make it Where the maxP robability refers to the max- so influential. Finally we decided for compu- imum of the probabilities of the languages tational reasons to use only 6 n-grams. and α < 0 is the value of restriction that tol- erates more or less the number of languages 4 Conclusions and future work that may be suggested. The bigger the α, the less tolerance to ambiguity of predicted In this paper we have described our approach languages for each tweet yet the more precise for the SEPLN 2014 shared task of tweet lan- the result, while the smaller the alpha, the guage identification (TweetLID). Our system higher the recall yet smaller the precision. is based on a pre-processing part taking into For the ranking-based method, the thresh- account the different accents can appear in old is chosen by running a search from 0 to different languages using language codifica- 0.3 with intervals of 0.05. The most optimum tions in the N-Gram distribution state with- found on the data set is 0.05. out erasing them. Also we have two different algorithms the 3.2 Empirical evaluation linear interpolation smoothing and the out- We ran experiments with different N-Gram of-place measure. These algorithms obtain values, from 1 to 8, and we set the α value an F-measure of 0.736 and 0.578 respectively to 10 which gave us the best results in the in the given test corpus of 19993 tweets. Our validation set. system ranked in the 3rd best place among the participants of the constrained track, us- lid: Tweet language identification at sepln ing the linear interpolation algorithm, and 2014. In In TweetLID @ SEPLN 2014. 6th in the unconstrained track, using the out- Jelinek, Frederick. 1997. Statistical Methods of-place measure. for Speech Recognition. MIT Press, Cam- Among the mistakes we made was to un- bridge, MA, USA. derestimate numerical digits in languages, which we removed. In the English language, numbers are often used to shorten text, thus making us lose great part of words for exam- ple; ”to forgive someone” might be written as: ’2 4give som1’. This is true in many in- ternet alphabets which are emerging such as Arabizi(the arabic chat language). For possible future work for the ranking- based method it might be interesting to con- sider the distribution of the length of words in each language since it can be a very determin- ing characteristic. Also in this method, the out of place measure should have penalized more severely the non non-appearing charac- ters in the document list instead of supposing it could be found on the last element of the list. Finally we have to stress the importance the pre-processing of tweets as one of the key parts in the project. References Brants, Thorsten. 2000. Tnt: A statisti- cal part-of-speech tagger. In Proceedings of the Sixth Conference on Applied Natu- ral Language Processing, ANLC ’00, pages 224–231, Stroudsburg, PA, USA. Associa- tion for Computational Linguistics. Carter, Simon, Wouter Weerkamp, and Manos Tsagkias. 2013. Microblog lan- guage identification: Overcoming the lim- itations of short, unedited and idiomatic text. Lang. Resour. Eval., 47(1):195–215, March. Cavnar, William B, John M Trenkle, et al. 1994. N-gram-based text categorization. Ann Arbor MI, 48113(2):161–175. Huang, Xuedong, Alex Acero, and Hsiao- Wuen Hon. 2001. Spoken Language Pro- cessing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition. Iñaki San Vicente, Arkaitz Zubiaga, Pablo Gamallo José Ramom Pichel Iñaki Ale- gria Nora Aranberri Aitzol Ezeiza VÃ- ctor Fresno. 2014. Overview of tweet-