=Paper= {{Paper |id=Vol-1228/paper3 |storemode=property |title=TweetSafa: Tweet Language Identification |pdfUrl=https://ceur-ws.org/Vol-1228/tweetlid-3-mendizabal.pdf |volume=Vol-1228 |dblpUrl=https://dblp.org/rec/conf/sepln/MendizabalCH14 }} ==TweetSafa: Tweet Language Identification== https://ceur-ws.org/Vol-1228/tweetlid-3-mendizabal.pdf
                  TweetSafa: Tweet language identification
                  TweetSafa: Identificación del lenguaje de tweets
                 Iosu Mendizabal             Jeroni Carandell & Daniel Horowitz
            (IIIA) Artificial Intelligence (UPC) Universitat Politècnica de Catalunya
                 Research Institute             (URV) Universitat Rovira i Virgili
            (CSIC) Spanish Council for            (UB) Universitat de Barcelona
                 Scientific Research                jeroni.carandell@gmail.com
                   iosu@iiia.csic.es               daniel.horowitzzz@gmail.com

    Resumen: Este artı́culo describe la metodologı́a utilizada en la tarea propuesta en SE-
    PLN 14 para la identificación de lenguaje de tweets (TweetLID), como se explica en (Iñaki
    San Vicente, 2014). El sistema consta de un preprocesamiento de tweets, creación de dic-
    cionarios a partir de N-Grams y dos algoritmos de reconocimiento de lenguaje.
    Palabras clave: Reconocimiento de lenguaje, lenguaje de tweets.
    Abstract: This paper describes the methodology used for the SEPLN 14 shared task of
    tweet language identification (TweetLID), as explained on (Iñaki San Vicente, 2014). The
    system consists of 3 stages: pre-processing of tweets, creation of a dictionary of n-grams,
    and two algorithms ultimately used for language identification.
    Keywords: Tweet identification, tweet language.
1    Introduction and objectives                    guistic processing.
Language identification is vital as a prelimi-         The rest of the article is laid out as fol-
nary step of any natural language processing        lows: Section 2 introduces the architecture
application. The increasing use of social net-      and components of the system: the pre-
works as an information exchange media is           processing state where the tweets are adapted
making of them a very important informa-            to a better comprehension for our algorithm
tion center. Twitter has become one of the          and the used algorithms. Afterwards, section
most powerful information exchange mecha-           3 describes our results for the given problem.
nisms and every day millions of users upload        To conclude, in section 4 we will try to draw
tons of tweets.                                     some conclusions and propose future works.
   The SEPLN 2014 TweetLID task focuses
on the automatic identification of the lan-
                                                    2     Architecture and components of
guage in which tweets are written, as the                 the system
identification of tweet language is arousing        We have presented two different approaches
an increasing interest in the scientific com-       to the problems which have been presented in
munity (Carter, Weerkamp, and Tsagkias,             track one (constrained) and track two (un-
2013). Identifying the language will help to        constrained). Both of these methods share
apply NLP techniques subsequently on the            great part of the process in terms of the set
tweet such as machine translation, sentiment        of tweets being used to learn from, as well
analysis, information extraction, etc. Accu-        as the way incoming tweets are preprocessed
rately identifying the language will facilitate     and learned.
the application of resources suitable to the
language in question.                               2.1    Pre-processing
   The scope of this task will focus on the         The first step of this process, is to identify
top five languages of the Iberian Penin-            the noise present in all tweets regardless of
sula:Spanish, Portuguese, Catalan, Basque,          the language. There are common issues re-
and Galician as well as English. These              lated to regular text, such as multiple space
languages are likely to co-occur along with         characters, but also specific Twitter tokens
many news and events relevant to the Iberian        like the user name tag or emoticons. After
Peninsula, and thus an accurate identifica-         identifying this issues, we are able to remove
tion of the language is key to make sure that       them using mostly regular expressions. We
we use the appropriate resources for the lin-       have highlighted the main issues found in the
tweet domain and what our approach was to-       distribution and comparing it with the lan-
wards it:                                        guages distributions. To do so, we took two
                                                 different approaches.
  • Different case characters: All characters
    were lowercased, so they wouldn’t inter-     2.3.1 Linear Interpolation
    fere in the identification process, since    The first method tries to find out what the
    the same character with different cases      probability is of a sentence being generated
    is treated as two different elements.        by each language by multiplying the prob-
                                                 ability of the consecutive N-Grams of the
  • Numbers, Emoticons: Since these kind         sentence in their respective languages. The
    of characters are presented equally in       problem appears when we deal with a small
    any language, they have been removed.        finite dataset and there are therefore not
  • Vowel repetitions: The vowel repeti-         enough instances to reliably estimate the
    tion is a common issue when dealing          probability, in other words, the sparse data
    with chatspeak. These kind repetitions       problem appears. This means that if a corpus
    could damage the algorithm’s perfor-         of a certain language does not have a certain
    mance, therefore they were completely        N-Gram, a sentence with the latter would au-
    removed and reduced to a maximum of          tomatically have a probability of zero.
    two from the text using regular expres-          To avoid this problem in the computation
    sions.                                       of the probabilities of each tweet for the lan-
                                                 guages of our N-Gram distribution we use the
  • Multiple spaces: This is also a common
                                                 linear interpolation smoothing method, also
    issue when dealing with tweets. The
                                                 known as the Jelinek-Mercer smoothing (Je-
    regular expression formats the text from
                                                 linek, 1997),(Huang, Acero, and Hon, 2001).
    multiple spaces into one space character.
                                                 To be able to use this smoothing method
When working with N-grams, it is important       we have to make a computation with our
to observe that not all special characters are   N-Gram corpus, the one generated with the
to be removed from the text, since they could    14991 tweets for the training purpose, to cal-
interfere in the identification process. Char-   culate the λ values. We create a dynamic
acters like the apostrophe, are more likely to   program to compute as many λ values as the
appear in English and Catalan than in oth-       N-Grams we extracted from the training set.
ers, therefore this kind of special characters   For instance, if we consider up to 5 N-Gram
must not be considered as noise, and we save     distributions for English we will compute 5
them for a better result.                        λ’s for each N-Gram up to 5, so all λi corre-
                                                 sponding to the i-Gram where i ∈ {1, ..., 5}.
2.2   N-GRAM distribution                        The probability of an N-Gram will be com-
To classify the tweets into languages using      puted as follows:
N-grams we have to extract meaningful dis-
tributions from each language. To do so,                                         n                   i−1
                                                 P(tn |t1 , t2 , ..., tn−1 ) =         λi P̂ (tn |
                                                                                 P                    T
we created documents of concatenated tweets                                                                tn−j )
                                                                                 i=1                 j=1
for each language: English, Spanish, Cata-
                                                 (1)
lan, Portuguese, Galician, Basque, other and
undetermined. Mixed labelled tweets such
as the ones with ’en+es’ as well as those        For any n and where P̂ are maximum like-
with ambiguous languages ’en/es’ are added       lihood estimates of the probabilities and
                                                 n
to both languages they contain (in this case     P
                                                       λi = 1, so P represent probability distri-
to both Spanish ’es’ and English ’en’). Then     i=1
we extract N-gram distributions in a dynamic     butions.
way so that we can choose the number of N           The values of λ are computed by deleted
we wish.                                         interpolation (Brants, 2000). This technique
                                                 successively removes each max-gram (biggest
2.3   Algorithms                                 n-gram) from the training corpus and esti-
Once we have N-gram distributions for each       mates the best values for the λ’s from all
language, given a new tweet we want to clas-     other n-grams in the corpus by adding a con-
sify we are going to find the most possible      fidence to the lambdas for the most propor-
language by extracting the tweet’s N-gram        tionally seen N-Gram . The algorithm is
given in Algorithm 1.                              language for a fixed n-gram. Given the tweet
                                                   ranking {Tin }i and a language L n-gram rank-
   set λ1 = λ2 = λ3 = ... = λn = 0;                ing {Lnj }j , the distance is computed by the
   foreach MAX-Gram (t1 , ..., tn ) with           sum of the number of indexes that an element
   count (t1 , ..., tn ) > 0 do                    of T has been displaced in list D. So we sum
                                                   | i − j | for every Tin in the tweet that is equal
      depending on the maximum of the              to Lnj . In the case that an element in {Tin }i
      next values:                                 does not exist in list {Lnj }j , we suppose the
      case count(t
                N −1
                    n )−1
                           : increment λ1 by       best case, i.e. that the non appearing ele-
      count(t1 , ..., tn )                         ment is in the bottom of the list. This as we
                                                   will discuss in section 4 might not have been
      end                                          such a good idea. Finally, to be able to com-
      case count(t  n−1 ,tn )−1
             count(tn−1 )−1 : increment λ2         pare different distances we need some kind of
      by count(t1 , ..., tn )                      proportion of the out-of-place measure that
      end                                          we describe as:
      case count(t  n−2 ,tn−1 ,tn ))−1
             count(tn−2 ,tn−1 )−1 : increment
                                                               outOf P laceM easure
                                                                                                 (2)
      λ3 by count(t1 , ..., tn )                             length(Tin i ) ∗ length(Lnj j )
      end                                          As we can see in Figure 1, the out-of-place
                             ...
             count(t1 ,...,tn )−1                  measure is calculated for a tweet from an En-
      case count(t1 ,t2 ,...,tn−1)−1
                                     : increment   glish dataset. The m and n parameters give
      λn by count(t1 , ..., tn )                   us the maximum number of elements we al-
      end                                          low for each list so that computational time
   end                                             does not get compromised by an unnecessary
 Algorithm 1: Deleted interpolation Algo-          whole search of all the n-grams in a language
 rithm.                                            (Cavnar, Trenkle, and others, 1994). This
                                                   is the part of this algorithm that makes it
                                                   unconstrained, since the parameters we used
2.3.2 Out-of-place measure                         came from a previous similar project we did
For this next method, for every n we will only     using self downloaded tweets and where we
consider a ranking list of n-grams ordered by      found that the values of m=80 and n=50
most to least frequent and where only the          were best. To avoid possible divisions by
order is preserved as opposed to the exact
frequencies. We decided to do this because
when it comes to comparing a single tweet
(documents of only 140 characters) to distri-
butions of each language, we cannot consider
that the frequency distribution of the tweet
n-grams will resemble the ones in the con-
catenated document. We can however, say
that the most frequent have a higher proba-
bility of appearance, but not necessarily with
proportional frequencies as in the document.
For this reason, we used the out-of-place mea-
sure.                                              Figure 1: Example of an out-of-place measure
    We decided to send this method as uncon-
strained because two of the parameters which       zero in equation 2, given that tweets are
we used, that will be discussed later on, were     sometimes zero or very close (especially af-
extracted from a previous work we did with         ter the cleaning of html’s, punctuation, etc),
a self downloaded corpus of tweets of differ-      we supose that if the number of characters
ent languages. We did this because it would        is smaller than three, the tweet is undeter-
take too long if we had to find the new values     mined. Again, a bold affirmation which needs
because of the huge search space.                  to be fine-touched in future work.
    This measure is a distance which will tell        Finally, in the training process, we are
us approximately how far the tweet is from a       going to reward each n-gram if it correctly
guessed a tweet. So if for example, a trigram        In figure 2 we can see the results of the
labels a tweet correctly but the unigrams and     experiments we made using the linear inter-
bigrams do not, we reward the trigrams with       polation method. We can observe how the
one point where the others don not get any.       results are going better while the N-Grams
We do this with all the tweets in the train-      are going bigger, but the peak of the results
ing set and in the end we get frequency of        are achieved with the 5-gram, from there on
reliability of each n-gram. When the test is      the results are slightly worst each gram we
done on a tweet, a weighted voting is done us-    sum.
ing these confidence parameters so that the
most voted languages counting the reliability
weight wins.

3     Setup and evaluation
The official result of our approach are the
next ones: In the constrained category us-
ing the linear interpolation algorithm, section
2.3.1, we obtained a precision of 0.777, a re-
call of 0.719 and a F-measure of 0.736.
   In the unconstrained category we used the
out-of-place measure algorithm, section 2.3.2,
and obtained the next results: precision of
0.598, recall of 0.625 and F-Measure of 0.578.
3.1     Empirical settings
Before submitting the final results we made
different executions with different maximum       Figure 2: Results obtained for the training
N-Grams to know which was the one with the        set with: Linear interpolation method.
best results. Also because of the ambiguity
of tweets with more than one language, for           Because of these results, we decided to
instance es+en, to compute this we take av-       send the 5-gram results for the test set given
erage value of all the probabilities of all the   for the SEPLN 2014 task.
languages and then create a threshold. For           In the case of the ranking based method,
the linear interpolation we used:                 we do not have to test different n-gram com-
                                                  binations since we obtain a reliability for each
                maxP robability − Average         n-gram to be truthful. So if a certain n-gram
    T hreshold =
                              α                   were systematically wrong it would have a
                                            (3)   very low confidence which would not make it
Where the maxP robability refers to the max-      so influential. Finally we decided for compu-
imum of the probabilities of the languages        tational reasons to use only 6 n-grams.
and α < 0 is the value of restriction that tol-
erates more or less the number of languages       4   Conclusions and future work
that may be suggested. The bigger the α,
the less tolerance to ambiguity of predicted      In this paper we have described our approach
languages for each tweet yet the more precise     for the SEPLN 2014 shared task of tweet lan-
the result, while the smaller the alpha, the      guage identification (TweetLID). Our system
higher the recall yet smaller the precision.      is based on a pre-processing part taking into
   For the ranking-based method, the thresh-      account the different accents can appear in
old is chosen by running a search from 0 to       different languages using language codifica-
0.3 with intervals of 0.05. The most optimum      tions in the N-Gram distribution state with-
found on the data set is 0.05.                    out erasing them.
                                                      Also we have two different algorithms the
3.2     Empirical evaluation                      linear interpolation smoothing and the out-
We ran experiments with different N-Gram          of-place measure. These algorithms obtain
values, from 1 to 8, and we set the α value       an F-measure of 0.736 and 0.578 respectively
to 10 which gave us the best results in the       in the given test corpus of 19993 tweets. Our
validation set.                                   system ranked in the 3rd best place among
the participants of the constrained track, us-       lid: Tweet language identification at sepln
ing the linear interpolation algorithm, and          2014. In In TweetLID @ SEPLN 2014.
6th in the unconstrained track, using the out-
                                                   Jelinek, Frederick. 1997. Statistical Methods
of-place measure.
                                                      for Speech Recognition. MIT Press, Cam-
    Among the mistakes we made was to un-
                                                      bridge, MA, USA.
derestimate numerical digits in languages,
which we removed. In the English language,
numbers are often used to shorten text, thus
making us lose great part of words for exam-
ple; ”to forgive someone” might be written
as: ’2 4give som1’. This is true in many in-
ternet alphabets which are emerging such as
Arabizi(the arabic chat language).
    For possible future work for the ranking-
based method it might be interesting to con-
sider the distribution of the length of words in
each language since it can be a very determin-
ing characteristic. Also in this method, the
out of place measure should have penalized
more severely the non non-appearing charac-
ters in the document list instead of supposing
it could be found on the last element of the
list.
    Finally we have to stress the importance
the pre-processing of tweets as one of the key
parts in the project.

References
Brants, Thorsten. 2000. Tnt: A statisti-
  cal part-of-speech tagger. In Proceedings
  of the Sixth Conference on Applied Natu-
  ral Language Processing, ANLC ’00, pages
  224–231, Stroudsburg, PA, USA. Associa-
  tion for Computational Linguistics.
Carter, Simon, Wouter Weerkamp, and
  Manos Tsagkias. 2013. Microblog lan-
  guage identification: Overcoming the lim-
  itations of short, unedited and idiomatic
  text. Lang. Resour. Eval., 47(1):195–215,
  March.
Cavnar, William B, John M Trenkle, et al.
  1994. N-gram-based text categorization.
  Ann Arbor MI, 48113(2):161–175.
Huang, Xuedong, Alex Acero, and Hsiao-
  Wuen Hon. 2001. Spoken Language Pro-
  cessing: A Guide to Theory, Algorithm,
  and System Development. Prentice Hall
  PTR, Upper Saddle River, NJ, USA, 1st
  edition.
Iñaki San Vicente, Arkaitz Zubiaga, Pablo
    Gamallo José Ramom Pichel Iñaki Ale-
    gria Nora Aranberri Aitzol Ezeiza VÃ-
    ctor Fresno. 2014. Overview of tweet-