=Paper=
{{Paper
|id=Vol-1228/paper5
|storemode=property
|title=Tweets Language Identification using Feature Weighting
|pdfUrl=https://ceur-ws.org/Vol-1228/tweetlid-5-diaz-zamora.pdf
|volume=Vol-1228
|dblpUrl=https://dblp.org/rec/conf/sepln/ZamoraBB14
}}
==Tweets Language Identification using Feature Weighting==
Tweets language identification using feature weighting Identificación de idioma en tweets mediante pesado de términos Juglar Díaz Zamora Adrian Fonseca Bruzón Universidad de Oriente Reynier Ortega Bueno Santiago de Cuba, Cuba CERPAMID juglar.diaz@cerpamid.co.cu Santiago de Cuba, Cuba {adrian, reynier.ortega}@cerpamid.co.cu Resumen: Este trabajo describe un método de detección de idiomas presentado en el Taller de Identificación de Idioma en Twitter (TweetLID-2014). El método propuesto representa los tweets por medio de trigramas de caracteres pesados de acuerdo a su relevancia para cada idioma. Para el pesado de los trigramas se emplearon tres esquemas de pesados de características tradicionalmente usados para la reducción de la dimensionalidad en la Clasificación de Textos. El idioma de cada tweet se obtiene mediante mayoría simple luego de sumar los pesos que cada trigrama presente en el tweet aporta para cada idioma. Finalmente, realizamos un análisis de los resultados obtenidos. Palabras clave: tweets, identificación de idioma, pesado de rasgos Abstract: This paper describes the language identification method presented in Twitter Language Identification Workshop (TweetLID-2014). The proposed method represents tweets by weighted character-level trigrams. We employed three different weighting schemes used in Text Categorization to obtain a numerical value that represents the relation between trigrams and languages. For each language, we add up the importance of each trigram. Afterward, tweet language is determined by simple majority voting. Finally, we analyze the results. Keywords: tweets, language identification, feature weighting Some work has been done for LID in long 1 Introduction and well-formed texts; traditional approaches are focused on words and n-grams (of With the growing interest in social networks characters). In word-based features, we can find like Twitter and Facebook, the research short word-based and frequency-based community has focused on applying data approaches. (Grefenstette, 1995) proposed a mining techniques to such sources of short word-based approach where he uses information. One of these sources are the words up to five characters that occurred at least messages produced in the social network three times. The idea behind this approach is Twitter, known as tweets. Tweets represent a the language specific significance of common challenge to traditional Text Mining techniques words like conjunctions having mostly only mainly due to two characteristics, the length of marginal lengths. In frequency based approach, the texts (only 140 characters allowed) and the (Souter et al., 1994) takes into account one Internet Slang present in these texts. Because of hundred high frequent words per language the limitations of 140 characters, people create extracted from training data for 9 languages and their own informal linguistic style by 91% of all documents were correctly identified. shortening words and using acronyms. The n-gram based approach uses n-grams of Language identification (LID) is the task of different (Cavnar and Trenkle, 1994) or fixed identifying the language in which a text is (Grefenstette, 1995; Prager, 1999) lengths from written. This is an important pre-processing tokenized words. step necessary for traditional Text Mining (Cavnar and Trenkle, 1994) evaluate their techniques; also, Natural Language Processing algorithm on a corpus of 3713 documents in 14 tasks like machine translation, part of speech languages, for language models of more than tagging and parsing are language dependent. 300 n-grams very good results of 99.8% were system and we analyze the effect of feature achieved. weighting schemes in tweets language The n-gram technique described by identification. Finally, conclusions and (Grefenstette, 1995) calculates the frequency of attractive directions for future work are trigrams in a language sample, the probability exposed. of a trigram for a language is approximated by summing the frequency of all trigrams for the 2 System description language and dividing the trigram frequency by In this section, we describe our system and the the sum of all frequencies in the language. The feature weighting schemes that we used in our probabilities are then used to guess the experiments. language by dividing the test into trigrams and calculating the probability of the sequence of trigrams for each language, assigning a minimal 2.1 Feature weighting probability to trigrams without assigned Dimensionality reduction (DR) is an important probabilities. The language with the highest step in Text Categorization. It can be defined as probability for the sequence of trigrams is the task of reducing the dimensionality of the chosen. traditional vector space representation for We consider that for LID to be effective, the documents; these are two main approaches to inflected forms of a root word should being this task (Sebastiani, 2002): related to the same word, and knowing that the •Dimensionality reduction by feature selection character-level n-grams of different (John, Kohavi and Pfleger, 1999): the chosen morphological variations of a word tend to features r’ are a subset of the original r produce many of the same n-grams, we features (e.g. words, phrases, stems, lemmas). chose to use character-level n-grams as features •Dimensionality reduction by feature in our approach. Since trigrams have proven extraction: chosen features are not a subset of good results in LID (Grefenstette, 1995; Prager, the original r features, but are obtained by 1999), this is our n-grams selection. combinations or transformations of the Some studies have shown that system original ones. designed for other types of texts perform well There are two distinct ways of viewing DR, on tweet language identification (TLID) (Lui depending on whether the task is performed and Baldwin, 2012), but some systems which locally (i.e., for each individual category) or were specifically designed for the globally. characteristics of tweets performed We focus on local feature selection schemes, better.(Carter et al., 2013). since our interest is to obtain the importance of There is also a body of work in TLID every trigram (features) for every language employing different techniques, for example (categories). graph representation of languages based in Many locally feature selection techniques trigrams (Trompand and Pechenizkiy, 2011), have been tried. We show in Table 1 those used combination of systems (Carter et al., 2013), in this paper, GSS Coefficient (GSS) user language profile, links and hashtags (Galavotti, Sebastiani, and Simi, 2000), NGL (Carter et al., 2013). Coefficient (NGL) (Ng, Goh, and Low, 1997) We propose a language identification system and Mutual Information (MI) (Battiti, 1994). based on feature weighting schemes (FWS), commonly used in Text Categorization (TC). FWS Mathematical form We obtain a numerical value that represents the 𝑃(𝑡𝑘 , 𝑐𝑖 ) relation between features, trigrams of characters MI log 𝑃(𝑡𝑘 ) ∗ 𝑃(𝑐𝑖 ) in our case, and languages. This proposal can be √𝑁 ∗ [𝑃(𝑡𝑘 , 𝑐𝑖 ) ∗ 𝑃(𝑡̅𝑘 , 𝑐̅) 𝑖 − 𝑃(𝑡𝑘 , 𝑐𝑖 ∗ 𝑃(𝑡̅𝑘 , 𝑐𝑖 )] ̅) extended to words and longer or shorter n- NGL √𝑃(𝑡𝑘 ) ∗ 𝑃(𝑐𝑖 ) ∗ 𝑃(𝑡̅𝑘 ) ∗ 𝑃(𝑐̅𝑖 ) grams. The remainder of the paper is structured as GSS 𝑃(𝑡𝑘 , 𝑐𝑖 ) ∗ 𝑃(𝑡̅𝑘 , 𝑐̅) 𝑖 − 𝑃(𝑡𝑘 , 𝑐𝑖 ∗ 𝑃(𝑡̅𝑘 , 𝑐𝑖 ) ̅) follows. In Section 2, we describe our tweet language identification system (Cerpamid- TLID2014) and the feature weighting schemes Table 1: Feature weighting schemes. tested. In Section 3, we present experiments conducted for estimating the parameters of our In our case, in order to make the feature 3 Experiments weighting schemes depending of the available amount of text of each language, and not of the In this section we explain the estimation of number of documents. The probabilities in threshold θ (see section 2.1, Algorithm 1, Step Table 1 are interpreted on an event space of 2), the experiments over the feature weighting schemes and results of our proposal in features (Sebastiani, 2002) (e.g., 𝑃(𝑡̅𝑘 , 𝑐𝑖 ) TweetLID-2014. In order to evaluate the feature denotes the join probability that the trigram tk weighting schemes and to estimate the does not occur in the language ci, computed as threshold of filtering θ, the corpus provided by rate between the number of trigram in ci the organizers of TweetLID-2014 was divided different to tk and the total number of trigrams into training, test and development sets. The in corpus N). training set is the 70% of the tweets not labeled For each language, we keep the most as language undefined or other. The remaining important trigrams and discard the rest. 30% was divided again in 70% and 30%.This 30% is our development set, this last 70% and 2.2 Language identification the tweets labeled undefined or other form the Our system is a three-step procedure; first, test set. trigrams are extracted from the tweet, then a In TweetLID-2014 the organizers proposed filtering phase takes place, in this phase those two modalities; constrained, where the training tweets that do not belong to the set of languages only can be performed over the training set that our system identify are labeled as other. provided at TweetLID-2014 and free training Finally, a language is assigned for the tweet. (unconstrained) where is possible to use any We present these steps in Algorithm 1. other resource. For the free training mode, we decided increase the amount of text per Algorithm 1. Cerpamid-TLID2014 language provided at TweetLID-2014 in order Considert the tweet to identify the language, c the to provide our proposal with greater ability to content of t and Lj the list of weighted trigrams differentiate one language from another. For for language j. English (161 mb), Portuguese (174mb) and Spanish (174 mb) we used texts from the Step 1: Split c in trigrams Europarl corpus (Koehn, 2005), and for Catalan a) Split c in words. (650 mb), Basque (181 mb) and Galician (157 b) Remove numbers, punctuation marks and mb), articles from Wikipedia. The training make all the text lowercase. corpus for our experiments in the free training c) Add underscore in the beginning and the mode is the 70% of the tweets not labeled as ending of every word. language undefined or other added to the d) Obtain the list lt of trigrams that represent documents from Europarl and Wikipedia. t. 3.1 Estimation of threshold θ Step 2: Filtering For estimating θ, first we obtain the list of a) Let trigrams_c be the number of trigrams trigrams weighted from the training set, even in lt and trigrams_in the number of when the weights are not used in this stage. trigrams in lt that appear in any Lj. Later we obtain a list L of the values n (see trigrams _ in b) Let n and θ a threshold of section 2.1, Algorithm 1, Step 2) for all the trigrams _ c tweets in the development set. Then, we repeat known trigrams in c. 10000 times a sampling with replacement over c) If n > θ go to step 3, else set language as L, in every one of these iterations we select the other. lowest value different from zero. These values are averaged and that is our threshold θ. The Step 3: Selecting Language idea is to estimate statistically the value of n for a) For each language Lj: a tweet written in one of the languages that we i) vote( L j ) weight ti , L j identify. The value obtained in our experiment was 0.9, and this was used for all runs. t i lt b) Label t with the most voted language. 3.2 Selecting the best feature weighting precision that we obtained with our own test scheme set, while the F1 measure was dropped for the lows values in recall. About the unconstrained In Table 2 we show the results obtained with version, we placed last with our two runs; each feature weighting scheme in our test sets, almost all team did worst in this mode. in the two modalities, constrained and unconstrained. The best FWS in both modes Group P R F1 was MI, while for every FWS the constrained UPV (2) 0.825 0.744 0.752 version obtained better results for LID. UPV (1) 0.824 0.730 0.745 In order to make a deeper analysis of our UB / UPC 0.777 0.719 0.736 system, we show in Table 3 the precision and Citius (1) 0.824 0.685 0.726 the numbers of assignations to every language RAE (2) 0.813 0.648 0.711 for our best combinations of feature weighting RAE (1) 0.818 0.645 0.710 scheme and task mode (MI in Constrained Citius (2) 0.689 0.772 0.710 Mode). CERPAMID (1) 0.716 0.681 0.666 UDC / LYS (1) 0.732 0.734 0.638 Precision Mode IIT-BHU 0.605 0.670 0.615 FWS CERPAMID (2) 0.704 0.578 0.605 (Averaged) MI 0.704 Constrained UDC / LYS (2) 0.610 0.582 0.498 MI 0.632 Free Training NGL 0.691 Constrained Table 4: Results at TLID-2014 for constrained NGL 0.522 Free Training mode GSS 0.585 Constrained GSS 0.431 Free Training Group P R F1 Citius (1) 0.802 0.748 0.753 Table 2: Results of each feature weighting UPV (2) 0.737 0.723 0.697 scheme. UPV (1) 0.742 0.686 0.684 Citius(2) 0.696 0.659 0.655 Language #Tweets Precision UDC / LYS (1) 0.682 0.688 0.581 UB / UPC 0.598 0.625 0.578 English 218 0.784 UDC/ LYS (2) 0.588 0.590 0.571 Portuguese 548 0.833 CERPAMID (1) 0.694 0.461 0.506 Catalan 345 0.817 CERPAMID (2) 0.583 0.537 0.501 Other 43 0.418 Basque 109 0.623 Table 5: Results at TLID-2014 for free training Galician 117 0.572 mode. Spanish 1826 0.882 4 Conclusions and future work Table 3: Results by language using MI in We presented a tweet language identification system based on trigrams of characters and constrained mode. feature weighting schemes used for Text Categorization. One of our run placed 8th 3.3 Results at TLID-2014 between 12 in the constrained version at TLID- For our participation at TLID-2014 we used the 2014 whilst in the free training version we full corpus provided by the organizers and, in placed last. Most of the system performed addition, the documents extracted from better in the constrained version. We found as Europarl and Wikipedia for the free training the main weakness of our proposal the mode. In Table 4 and Table 5 we show our identification of tweets labeled other. As future results at TweetLID-2014. As can be seen we work; we consider exploring other features, test placed 8th between 12 about runs and 5th others feature weighting schemes and tackle the between 7 about groups in the constrained problem of the identification of tweets labeled mode (Table 4). Our results in precision at as other with the inclusion of lists of common TweetLID-2014 are similar to the results in terms used in tweets in the step of filtering. Bibliography identification using corpus-based models. Hermes Journal of Linguistics, 13:183–203. Battiti, R.1994. Using mutual information for selecting features in supervised neural net Tromp, E. and M. Pechenizkiy. 2011. Graph- learning. Neural Networks, 5(4):537–550. based n-gram language identification on short texts. In Proceedings of Benelearn Carter, S., W. Weerkamp, and MTsagkias.2013. 2011, pages 27–35, The Hague, Netherlands. Microblog language identification: Overcoming the limitations of short, Vatanen, T., J. Vayrynen., and S. unedited and idiomatic text. Language Virpioja.2010. Language identification of Resources and Evaluation, pages 1–21. short text segments with n-gram models. In LREC 2010, pages 3423–3430. Cavnar, W., J. Trenkle. 1994. N-gram-based text categorization. In Proceedings of the Third Symposium on Document Analysis and Information Retrieval, Las Vegas, USA. Galavotti, L., F. Sebastiani, and M. Simi. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries (Lisbon, Portugal, 2000), 59–68. Grefenstette, G.1995. Comparing two language identification schemes. In 3rd International conference On Statistical Analysis of Textual Data. John, G. H., R. Kohavi, and K. Pfleger, 1994. Irrelevant features and the subset selection problem. In Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ, 1994), 121– 129. Koehn, P., 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In: MT Summit. Ng, H.T., W.B. Goh, and K. L. Low. 1997. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval (Philadelphia, PA, 1997), 67–73. Prager, J. 1999. Linguini: Language identification for multilingual documents. In Proceedings of the 32nd Hawaii International Conference on System Sciences. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47. Souter, G. Churcher, J. Hayes, J. Hughes, and S. Johnson. 1994. Natural language