=Paper= {{Paper |id=Vol-1228/paper5 |storemode=property |title=Tweets Language Identification using Feature Weighting |pdfUrl=https://ceur-ws.org/Vol-1228/tweetlid-5-diaz-zamora.pdf |volume=Vol-1228 |dblpUrl=https://dblp.org/rec/conf/sepln/ZamoraBB14 }} ==Tweets Language Identification using Feature Weighting== https://ceur-ws.org/Vol-1228/tweetlid-5-diaz-zamora.pdf
         Tweets language identification using feature weighting
         Identificación de idioma en tweets mediante pesado de términos
             Juglar Díaz Zamora                             Adrian Fonseca Bruzón
            Universidad de Oriente                           Reynier Ortega Bueno
            Santiago de Cuba, Cuba                                CERPAMID
          juglar.diaz@cerpamid.co.cu                         Santiago de Cuba, Cuba
                                                    {adrian, reynier.ortega}@cerpamid.co.cu


      Resumen: Este trabajo describe un método de detección de idiomas presentado en el Taller de
      Identificación de Idioma en Twitter (TweetLID-2014). El método propuesto representa los
      tweets por medio de trigramas de caracteres pesados de acuerdo a su relevancia para cada
      idioma. Para el pesado de los trigramas se emplearon tres esquemas de pesados de
      características tradicionalmente usados para la reducción de la dimensionalidad en la
      Clasificación de Textos. El idioma de cada tweet se obtiene mediante mayoría simple luego de
      sumar los pesos que cada trigrama presente en el tweet aporta para cada idioma. Finalmente,
      realizamos un análisis de los resultados obtenidos.
      Palabras clave: tweets, identificación de idioma, pesado de rasgos

      Abstract: This paper describes the language identification method presented in Twitter
      Language Identification Workshop (TweetLID-2014). The proposed method represents tweets
      by weighted character-level trigrams. We employed three different weighting schemes used in
      Text Categorization to obtain a numerical value that represents the relation between trigrams
      and languages. For each language, we add up the importance of each trigram. Afterward, tweet
      language is determined by simple majority voting. Finally, we analyze the results.
      Keywords: tweets, language identification, feature weighting

                                                        Some work has been done for LID in long
1   Introduction                                    and well-formed texts; traditional approaches
                                                    are focused on words and n-grams (of
With the growing interest in social networks
                                                    characters). In word-based features, we can find
like Twitter and Facebook, the research
                                                    short    word-based      and    frequency-based
community has focused on applying data
                                                    approaches. (Grefenstette, 1995) proposed a
mining techniques to such sources of
                                                    short word-based approach where he uses
information. One of these sources are the
                                                    words up to five characters that occurred at least
messages produced in the social network
                                                    three times. The idea behind this approach is
Twitter, known as tweets. Tweets represent a
                                                    the language specific significance of common
challenge to traditional Text Mining techniques
                                                    words like conjunctions having mostly only
mainly due to two characteristics, the length of
                                                    marginal lengths. In frequency based approach,
the texts (only 140 characters allowed) and the
                                                    (Souter et al., 1994) takes into account one
Internet Slang present in these texts. Because of
                                                    hundred high frequent words per language
the limitations of 140 characters, people create
                                                    extracted from training data for 9 languages and
their own informal linguistic style by
                                                    91% of all documents were correctly identified.
shortening words and using acronyms.
                                                        The n-gram based approach uses n-grams of
    Language identification (LID) is the task of
                                                    different (Cavnar and Trenkle, 1994) or fixed
identifying the language in which a text is
                                                    (Grefenstette, 1995; Prager, 1999) lengths from
written. This is an important pre-processing
                                                    tokenized words.
step necessary for traditional Text Mining
                                                        (Cavnar and Trenkle, 1994) evaluate their
techniques; also, Natural Language Processing
                                                    algorithm on a corpus of 3713 documents in 14
tasks like machine translation, part of speech
                                                    languages, for language models of more than
tagging and parsing are language dependent.
300 n-grams very good results of 99.8% were         system and we analyze the effect of feature
achieved.                                           weighting schemes in tweets language
    The n-gram technique described by               identification. Finally, conclusions and
(Grefenstette, 1995) calculates the frequency of    attractive directions for future work are
trigrams in a language sample, the probability      exposed.
of a trigram for a language is approximated by
summing the frequency of all trigrams for the       2        System description
language and dividing the trigram frequency by
                                                    In this section, we describe our system and the
the sum of all frequencies in the language. The
                                                    feature weighting schemes that we used in our
probabilities are then used to guess the
                                                    experiments.
language by dividing the test into trigrams and
calculating the probability of the sequence of
trigrams for each language, assigning a minimal     2.1       Feature weighting
probability to trigrams without assigned            Dimensionality reduction (DR) is an important
probabilities. The language with the highest        step in Text Categorization. It can be defined as
probability for the sequence of trigrams is         the task of reducing the dimensionality of the
chosen.                                             traditional vector space representation for
    We consider that for LID to be effective, the   documents; these are two main approaches to
inflected forms of a root word should being         this task (Sebastiani, 2002):
related to the same word, and knowing that the      •Dimensionality reduction by feature selection
character-level        n-grams      of different      (John, Kohavi and Pfleger, 1999): the chosen
morphological variations of a word tend to            features r’ are a subset of the original r
produce many of the same n-grams, we                  features (e.g. words, phrases, stems, lemmas).
chose to use character-level n-grams as features    •Dimensionality       reduction     by    feature
in our approach. Since trigrams have proven           extraction: chosen features are not a subset of
good results in LID (Grefenstette, 1995; Prager,      the original r features, but are obtained by
1999), this is our n-grams selection.                 combinations or transformations of the
    Some studies have shown that system               original ones.
designed for other types of texts perform well          There are two distinct ways of viewing DR,
on tweet language identification (TLID) (Lui        depending on whether the task is performed
and Baldwin, 2012), but some systems which          locally (i.e., for each individual category) or
were      specifically     designed    for    the   globally.
characteristics       of     tweets    performed        We focus on local feature selection schemes,
better.(Carter et al., 2013).                       since our interest is to obtain the importance of
    There is also a body of work in TLID            every trigram (features) for every language
employing different techniques, for example         (categories).
graph representation of languages based in              Many locally feature selection techniques
trigrams (Trompand and Pechenizkiy, 2011),          have been tried. We show in Table 1 those used
combination of systems (Carter et al., 2013),       in this paper, GSS Coefficient (GSS)
user language profile, links and hashtags           (Galavotti, Sebastiani, and Simi, 2000), NGL
(Carter et al., 2013).                              Coefficient (NGL) (Ng, Goh, and Low, 1997)
    We propose a language identification system     and Mutual Information (MI) (Battiti, 1994).
based on feature weighting schemes (FWS),
commonly used in Text Categorization (TC).              FWS     Mathematical form
We obtain a numerical value that represents the                                  𝑃(𝑡𝑘 , 𝑐𝑖 )
relation between features, trigrams of characters       MI                 log
                                                                               𝑃(𝑡𝑘 ) ∗ 𝑃(𝑐𝑖 )
in our case, and languages. This proposal can be                √𝑁 ∗ [𝑃(𝑡𝑘 , 𝑐𝑖 ) ∗ 𝑃(𝑡̅𝑘 , 𝑐̅)
                                                                                              𝑖 − 𝑃(𝑡𝑘 , 𝑐𝑖 ∗ 𝑃(𝑡̅𝑘 , 𝑐𝑖 )]
                                                                                                         ̅)
extended to words and longer or shorter n-              NGL
                                                                       √𝑃(𝑡𝑘 ) ∗ 𝑃(𝑐𝑖 ) ∗ 𝑃(𝑡̅𝑘 ) ∗ 𝑃(𝑐̅𝑖 )
grams.
    The remainder of the paper is structured as         GSS      𝑃(𝑡𝑘 , 𝑐𝑖 ) ∗ 𝑃(𝑡̅𝑘 , 𝑐̅)
                                                                                         𝑖 − 𝑃(𝑡𝑘 , 𝑐𝑖 ∗ 𝑃(𝑡̅𝑘 , 𝑐𝑖 )
                                                                                                    ̅)
follows. In Section 2, we describe our tweet
language identification system (Cerpamid-
TLID2014) and the feature weighting schemes                  Table 1: Feature weighting schemes.
tested. In Section 3, we present experiments
conducted for estimating the parameters of our
    In our case, in order to make the feature          3     Experiments
weighting schemes depending of the available
amount of text of each language, and not of the        In this section we explain the estimation of
number of documents. The probabilities in              threshold θ (see section 2.1, Algorithm 1, Step
Table 1 are interpreted on an event space of           2), the experiments over the feature weighting
                                                       schemes and results of our proposal in
features (Sebastiani, 2002) (e.g., 𝑃(𝑡̅𝑘 , 𝑐𝑖 )
                                                       TweetLID-2014. In order to evaluate the feature
denotes the join probability that the trigram tk
                                                       weighting schemes and to estimate the
does not occur in the language ci, computed as
                                                       threshold of filtering θ, the corpus provided by
rate between the number of trigram in ci
                                                       the organizers of TweetLID-2014 was divided
different to tk and the total number of trigrams
                                                       into training, test and development sets. The
in corpus N).
                                                       training set is the 70% of the tweets not labeled
    For each language, we keep the most
                                                       as language undefined or other. The remaining
important trigrams and discard the rest.
                                                       30% was divided again in 70% and 30%.This
                                                       30% is our development set, this last 70% and
2.2     Language identification                        the tweets labeled undefined or other form the
Our system is a three-step procedure; first,           test set.
trigrams are extracted from the tweet, then a              In TweetLID-2014 the organizers proposed
filtering phase takes place, in this phase those       two modalities; constrained, where the training
tweets that do not belong to the set of languages      only can be performed over the training set
that our system identify are labeled as other.         provided at TweetLID-2014 and free training
Finally, a language is assigned for the tweet.         (unconstrained) where is possible to use any
We present these steps in Algorithm 1.                 other resource. For the free training mode, we
                                                       decided increase the amount of text per
   Algorithm 1. Cerpamid-TLID2014                      language provided at TweetLID-2014 in order
Considert the tweet to identify the language, c the    to provide our proposal with greater ability to
content of t and Lj the list of weighted trigrams      differentiate one language from another. For
for language j.                                        English (161 mb), Portuguese (174mb) and
                                                       Spanish (174 mb) we used texts from the
Step 1: Split c in trigrams                            Europarl corpus (Koehn, 2005), and for Catalan
    a) Split c in words.                               (650 mb), Basque (181 mb) and Galician (157
    b) Remove numbers, punctuation marks and           mb), articles from Wikipedia. The training
        make all the text lowercase.                   corpus for our experiments in the free training
    c) Add underscore in the beginning and the         mode is the 70% of the tweets not labeled as
        ending of every word.                          language undefined or other added to the
    d) Obtain the list lt of trigrams that represent   documents from Europarl and Wikipedia.
        t.
                                                       3.1    Estimation of threshold θ
Step 2: Filtering                                      For estimating θ, first we obtain the list of
    a) Let trigrams_c be the number of trigrams        trigrams weighted from the training set, even
        in lt and trigrams_in the number of            when the weights are not used in this stage.
        trigrams in lt that appear in any Lj.          Later we obtain a list L of the values n (see
                trigrams _ in
    b) Let n                   and θ a threshold of   section 2.1, Algorithm 1, Step 2) for all the
                 trigrams _ c                          tweets in the development set. Then, we repeat
        known trigrams in c.                           10000 times a sampling with replacement over
    c) If n > θ go to step 3, else set language as     L, in every one of these iterations we select the
        other.                                         lowest value different from zero. These values
                                                       are averaged and that is our threshold θ. The
Step 3: Selecting Language                             idea is to estimate statistically the value of n for
    a) For each language Lj:                           a tweet written in one of the languages that we
        i) vote( L j )            
                          weight ti , L j             identify. The value obtained in our experiment
                                                       was 0.9, and this was used for all runs.
                         t i lt

      b) Label t with the most voted language.
3.2 Selecting the best feature weighting           precision that we obtained with our own test
scheme                                             set, while the F1 measure was dropped for the
                                                   lows values in recall. About the unconstrained
In Table 2 we show the results obtained with       version, we placed last with our two runs;
each feature weighting scheme in our test sets,    almost all team did worst in this mode.
in the two modalities, constrained and
unconstrained. The best FWS in both modes          Group              P         R           F1
was MI, while for every FWS the constrained        UPV (2)            0.825     0.744       0.752
version obtained better results for LID.           UPV (1)            0.824     0.730       0.745
   In order to make a deeper analysis of our       UB / UPC           0.777     0.719       0.736
system, we show in Table 3 the precision and       Citius (1)         0.824     0.685       0.726
the numbers of assignations to every language      RAE (2)            0.813     0.648       0.711
for our best combinations of feature weighting     RAE (1)            0.818     0.645       0.710
scheme and task mode (MI in Constrained            Citius (2)         0.689     0.772       0.710
Mode).                                             CERPAMID (1)       0.716     0.681       0.666
                                                   UDC / LYS (1)      0.732     0.734       0.638
                 Precision     Mode                IIT-BHU            0.605     0.670       0.615
FWS                                                CERPAMID (2)       0.704     0.578       0.605
                (Averaged)
MI                 0.704       Constrained         UDC / LYS (2)      0.610     0.582       0.498
MI                 0.632       Free Training
NGL                0.691       Constrained
                                                   Table 4: Results at TLID-2014 for constrained
NGL                0.522       Free Training
                                                                        mode
GSS                0.585       Constrained
GSS                0.431       Free Training
                                                    Group              P         R         F1
                                                    Citius (1)         0.802     0.748     0.753
  Table 2: Results of each feature weighting        UPV (2)            0.737     0.723     0.697
                   scheme.                          UPV (1)            0.742     0.686     0.684
                                                    Citius(2)          0.696     0.659     0.655
 Language             #Tweets          Precision    UDC / LYS (1)      0.682     0.688     0.581
                                                    UB / UPC           0.598     0.625     0.578
 English                218              0.784
                                                    UDC/ LYS (2)       0.588     0.590     0.571
 Portuguese             548              0.833      CERPAMID (1)       0.694     0.461     0.506
 Catalan                345              0.817      CERPAMID (2)       0.583     0.537     0.501
 Other                   43              0.418
 Basque                 109              0.623     Table 5: Results at TLID-2014 for free training
 Galician               117              0.572
                                                                        mode.
 Spanish                1826             0.882
                                                   4   Conclusions and future work

      Table 3: Results by language using MI in     We presented a tweet language identification
                                                   system based on trigrams of characters and
                 constrained mode.
                                                   feature weighting schemes used for Text
                                                   Categorization. One of our run placed 8th
3.3      Results at TLID-2014                      between 12 in the constrained version at TLID-
For our participation at TLID-2014 we used the     2014 whilst in the free training version we
full corpus provided by the organizers and, in     placed last. Most of the system performed
addition, the documents extracted from             better in the constrained version. We found as
Europarl and Wikipedia for the free training       the main weakness of our proposal the
mode. In Table 4 and Table 5 we show our           identification of tweets labeled other. As future
results at TweetLID-2014. As can be seen we        work; we consider exploring other features, test
placed 8th between 12 about runs and 5th           others feature weighting schemes and tackle the
between 7 about groups in the constrained          problem of the identification of tweets labeled
mode (Table 4). Our results in precision at        as other with the inclusion of lists of common
TweetLID-2014 are similar to the results in        terms used in tweets in the step of filtering.
Bibliography                                          identification using corpus-based models.
                                                      Hermes Journal of Linguistics, 13:183–203.
Battiti, R.1994. Using mutual information for
   selecting features in supervised neural net     Tromp, E. and M. Pechenizkiy. 2011. Graph-
   learning. Neural Networks, 5(4):537–550.           based n-gram language identification on
                                                      short texts. In Proceedings of Benelearn
Carter, S., W. Weerkamp, and MTsagkias.2013.          2011, pages 27–35, The Hague, Netherlands.
   Microblog       language     identification:
   Overcoming the limitations of short,            Vatanen, T., J. Vayrynen., and S.
   unedited and idiomatic text. Language             Virpioja.2010. Language identification of
   Resources and Evaluation, pages 1–21.             short text segments with n-gram models. In
                                                     LREC 2010, pages 3423–3430.
Cavnar, W., J. Trenkle. 1994. N-gram-based
  text categorization. In Proceedings of the
  Third Symposium on Document Analysis
  and Information Retrieval, Las Vegas, USA.
Galavotti, L., F. Sebastiani, and M. Simi. 2000.
  Experiments on the use of feature selection
  and negative evidence in automated text
  categorization. In Proceedings of ECDL-00,
  4th European Conference on Research and
  Advanced Technology for Digital Libraries
  (Lisbon, Portugal, 2000), 59–68.
Grefenstette, G.1995. Comparing two language
   identification schemes. In 3rd International
   conference On Statistical Analysis of
   Textual Data.
John, G. H., R. Kohavi, and K. Pfleger, 1994.
   Irrelevant features and the subset selection
   problem. In Proceedings of ICML-94, 11th
   International Conference on Machine
   Learning (New Brunswick, NJ, 1994), 121–
   129.
Koehn, P., 2005. Europarl: A Parallel Corpus
  for Statistical Machine Translation. In: MT
  Summit.
Ng, H.T., W.B. Goh, and K. L. Low. 1997.
  Feature selection, perceptron learning, and a
  usability case study for text categorization.
  In Proceedings of SIGIR-97, 20th ACM
  International Conference on Research and
  Development in Information Retrieval
  (Philadelphia, PA, 1997), 67–73.
Prager, J. 1999. Linguini: Language
   identification for multilingual documents. In
   Proceedings of the 32nd Hawaii International
   Conference on System Sciences.
Sebastiani, F. 2002. Machine learning in
   automated text categorization. ACM
   computing surveys (CSUR), 34(1), 1-47.
Souter, G. Churcher, J. Hayes, J. Hughes, and
  S. Johnson. 1994. Natural language