<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TweetSafa: Tweet language identi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iosu Mendizabal Jeroni Carandell</string-name>
          <email>iosu@iiia.csic.es</email>
          <email>jeroni.carandell@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Horowitz</string-name>
          <email>daniel.horowitzzz@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>(IIIA) Arti cial Intelligence (UPC) Universitat Polit A</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the methodology used for the SEPLN 14 shared task of tweet language identi cation (TweetLID), as explained on (In~aki San Vicente, 2014). The system consists of 3 stages: pre-processing of tweets, creation of a dictionary of n-grams, and two algorithms ultimately used for language identi cation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Language identi cation is vital as a
preliminary step of any natural language processing
application. The increasing use of social
networks as an information exchange media is
making of them a very important
information center. Twitter has become one of the
most powerful information exchange
mechanisms and every day millions of users upload
tons of tweets.</p>
      <p>
        The SEPLN 2014 TweetLID task focuses
on the automatic identi cation of the
language in which tweets are written, as the
identi cation of tweet language is arousing
an increasing interest in the scienti c
community
        <xref ref-type="bibr" rid="ref2">(Carter, Weerkamp, and Tsagkias,
2013)</xref>
        . Identifying the language will help to
apply NLP techniques subsequently on the
tweet such as machine translation, sentiment
analysis, information extraction, etc.
Accurately identifying the language will facilitate
the application of resources suitable to the
language in question.
      </p>
      <p>The scope of this task will focus on the
top ve languages of the Iberian
Peninsula:Spanish, Portuguese, Catalan, Basque,
and Galician as well as English. These
languages are likely to co-occur along with
many news and events relevant to the Iberian
Peninsula, and thus an accurate identi
cation of the language is key to make sure that
we use the appropriate resources for the
linguistic processing.</p>
      <p>The rest of the article is laid out as
follows: Section 2 introduces the architecture
and components of the system: the
preprocessing state where the tweets are adapted
to a better comprehension for our algorithm
and the used algorithms. Afterwards, section
3 describes our results for the given problem.
To conclude, in section 4 we will try to draw
some conclusions and propose future works.
2</p>
      <p>Architecture and components of
the system
We have presented two di erent approaches
to the problems which have been presented in
track one (constrained) and track two
(unconstrained). Both of these methods share
great part of the process in terms of the set
of tweets being used to learn from, as well
as the way incoming tweets are preprocessed
and learned.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Pre-processing</title>
      <p>The rst step of this process, is to identify
the noise present in all tweets regardless of
the language. There are common issues
related to regular text, such as multiple space
characters, but also speci c Twitter tokens
like the user name tag or emoticons. After
identifying this issues, we are able to remove
them using mostly regular expressions. We
have highlighted the main issues found in the
tweet domain and what our approach was
towards it:</p>
      <p>Di erent case characters: All characters
were lowercased, so they wouldn't
interfere in the identi cation process, since
the same character with di erent cases
is treated as two di erent elements.</p>
      <p>Numbers, Emoticons: Since these kind
of characters are presented equally in
any language, they have been removed.
Vowel repetitions: The vowel
repetition is a common issue when dealing
with chatspeak. These kind repetitions
could damage the algorithm's
performance, therefore they were completely
removed and reduced to a maximum of
two from the text using regular
expressions.</p>
      <p>Multiple spaces: This is also a common
issue when dealing with tweets. The
regular expression formats the text from
multiple spaces into one space character.
When working with N-grams, it is important
to observe that not all special characters are
to be removed from the text, since they could
interfere in the identi cation process.
Characters like the apostrophe, are more likely to
appear in English and Catalan than in
others, therefore this kind of special characters
must not be considered as noise, and we save
them for a better result.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>N-GRAM distribution</title>
      <p>To classify the tweets into languages using
N-grams we have to extract meaningful
distributions from each language. To do so,
we created documents of concatenated tweets
for each language: English, Spanish,
Catalan, Portuguese, Galician, Basque, other and
undetermined. Mixed labelled tweets such
as the ones with 'en+es' as well as those
with ambiguous languages 'en/es' are added
to both languages they contain (in this case
to both Spanish 'es' and English 'en'). Then
we extract N-gram distributions in a dynamic
way so that we can choose the number of N
we wish.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Algorithms</title>
      <p>Once we have N-gram distributions for each
language, given a new tweet we want to
classify we are going to nd the most possible
language by extracting the tweet's N-gram
distribution and comparing it with the
languages distributions. To do so, we took two
di erent approaches.</p>
      <sec id="sec-4-1">
        <title>2.3.1 Linear Interpolation</title>
        <p>The rst method tries to nd out what the
probability is of a sentence being generated
by each language by multiplying the
probability of the consecutive N-Grams of the
sentence in their respective languages. The
problem appears when we deal with a small
nite dataset and there are therefore not
enough instances to reliably estimate the
probability, in other words, the sparse data
problem appears. This means that if a corpus
of a certain language does not have a certain
N-Gram, a sentence with the latter would
automatically have a probability of zero.</p>
        <p>
          To avoid this problem in the computation
of the probabilities of each tweet for the
languages of our N-Gram distribution we use the
linear interpolation smoothing method, also
known as the Jelinek-Mercer smoothing
          <xref ref-type="bibr" rid="ref6">(Jelinek, 1997)</xref>
          ,
          <xref ref-type="bibr" rid="ref4">(Huang, Acero, and Hon, 2001)</xref>
          .
To be able to use this smoothing method
we have to make a computation with our
N-Gram corpus, the one generated with the
14991 tweets for the training purpose, to
calculate the values. We create a dynamic
program to compute as many values as the
N-Grams we extracted from the training set.
For instance, if we consider up to 5 N-Gram
distributions for English we will compute 5
's for each N-Gram up to 5, so all i
corresponding to the i-Gram where i 2 f1; :::; 5g.
The probability of an N-Gram will be
computed as follows:
P(tnjt1; t2; :::; tn 1) =
(1)
n
P iP^(tn j
i=1
i 1
T tn j )
j=1
For any n and where P^ are maximum
likelihood estimates of the probabilities and
n
P i = 1, so P represent probability
distrii=1
butions.
        </p>
        <p>
          The values of are computed by deleted
interpolation
          <xref ref-type="bibr" rid="ref1">(Brants, 2000)</xref>
          . This technique
successively removes each max-gram (biggest
n-gram) from the training corpus and
estimates the best values for the 's from all
other n-grams in the corpus by adding a
condence to the lambdas for the most
proportionally seen N-Gram . The algorithm is
given in Algorithm 1.
        </p>
        <p>set 1 = 2 = 3 = ::: =
n = 0;
foreach MAX-Gram (t1; :::; tn) with
count (t1; :::; tn) &gt; 0 do
depending on the maximum of the
next values:
case count(tn) 1 : increment 1 by</p>
        <p>N 1
count(t1; :::; tn)
end
end
end
case cocuonutn(tt(ntn1;1t)n) 1 1 : increment 2
by count(t1; :::; tn)
case cocuonutn(ttn(tn2;t2n;tn1;t1n)))1 1 : increment
3 by count(t1; :::; tn)
end</p>
        <p>:::
count(t1;:::;tn) 1
case count(t1;t2;:::;tn 1) 1 : increment</p>
        <p>n by count(t1; :::; tn)
end
Algorithm 1: Deleted interpolation
Algorithm.</p>
      </sec>
      <sec id="sec-4-2">
        <title>2.3.2 Out-of-place measure</title>
        <p>For this next method, for every n we will only
consider a ranking list of n-grams ordered by
most to least frequent and where only the
order is preserved as opposed to the exact
frequencies. We decided to do this because
when it comes to comparing a single tweet
(documents of only 140 characters) to
distributions of each language, we cannot consider
that the frequency distribution of the tweet
n-grams will resemble the ones in the
concatenated document. We can however, say
that the most frequent have a higher
probability of appearance, but not necessarily with
proportional frequencies as in the document.
For this reason, we used the out-of-place
measure.</p>
        <p>We decided to send this method as
unconstrained because two of the parameters which
we used, that will be discussed later on, were
extracted from a previous work we did with
a self downloaded corpus of tweets of di
erent languages. We did this because it would
take too long if we had to nd the new values
because of the huge search space.</p>
        <p>This measure is a distance which will tell
us approximately how far the tweet is from a
language for a xed n-gram. Given the tweet
ranking fTingi and a language L n-gram
ranking fLjngj , the distance is computed by the
sum of the number of indexes that an element
of T has been displaced in list D. So we sum
j i j j for every Tin in the tweet that is equal
to Ln. In the case that an element in fTingi
j
does not exist in list fLjngj , we suppose the
best case, i.e. that the non appearing
element is in the bottom of the list. This as we
will discuss in section 4 might not have been
such a good idea. Finally, to be able to
compare di erent distances we need some kind of
proportion of the out-of-place measure that
we describe as:</p>
        <p>
          outOf P laceM easure
length(Tini) length(Ljnj )
(2)
As we can see in Figure 1, the out-of-place
measure is calculated for a tweet from an
English dataset. The m and n parameters give
us the maximum number of elements we
allow for each list so that computational time
does not get compromised by an unnecessary
whole search of all the n-grams in a language
          <xref ref-type="bibr" rid="ref3">(Cavnar, Trenkle, and others, 1994)</xref>
          . This
is the part of this algorithm that makes it
unconstrained, since the parameters we used
came from a previous similar project we did
using self downloaded tweets and where we
found that the values of m=80 and n=50
were best. To avoid possible divisions by
zero in equation 2, given that tweets are
sometimes zero or very close (especially
after the cleaning of html's, punctuation, etc),
we supose that if the number of characters
is smaller than three, the tweet is
undetermined. Again, a bold a rmation which needs
to be ne-touched in future work.
        </p>
        <p>Finally, in the training process, we are
going to reward each n-gram if it correctly
guessed a tweet. So if for example, a trigram
labels a tweet correctly but the unigrams and
bigrams do not, we reward the trigrams with
one point where the others don not get any.
We do this with all the tweets in the
training set and in the end we get frequency of
reliability of each n-gram. When the test is
done on a tweet, a weighted voting is done
using these con dence parameters so that the
most voted languages counting the reliability
weight wins.
3</p>
        <p>Setup and evaluation
The o cial result of our approach are the
next ones: In the constrained category
using the linear interpolation algorithm, section
2.3.1, we obtained a precision of 0.777, a
recall of 0.719 and a F-measure of 0.736.</p>
        <p>In the unconstrained category we used the
out-of-place measure algorithm, section 2.3.2,
and obtained the next results: precision of
0.598, recall of 0.625 and F-Measure of 0.578.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Empirical settings</title>
      <p>Before submitting the nal results we made
di erent executions with di erent maximum
N-Grams to know which was the one with the
best results. Also because of the ambiguity
of tweets with more than one language, for
instance es+en, to compute this we take
average value of all the probabilities of all the
languages and then create a threshold. For
the linear interpolation we used:
maxP robability</p>
      <p>Average
T hreshold =
(3)
Where the maxP robability refers to the
maximum of the probabilities of the languages
and &lt; 0 is the value of restriction that
tolerates more or less the number of languages
that may be suggested. The bigger the ,
the less tolerance to ambiguity of predicted
languages for each tweet yet the more precise
the result, while the smaller the alpha, the
higher the recall yet smaller the precision.</p>
      <p>For the ranking-based method, the
threshold is chosen by running a search from 0 to
0.3 with intervals of 0.05. The most optimum
found on the data set is 0.05.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Empirical evaluation</title>
      <p>We ran experiments with di erent N-Gram
values, from 1 to 8, and we set the value
to 10 which gave us the best results in the
validation set.</p>
      <p>In gure 2 we can see the results of the
experiments we made using the linear
interpolation method. We can observe how the
results are going better while the N-Grams
are going bigger, but the peak of the results
are achieved with the 5-gram, from there on
the results are slightly worst each gram we
sum.</p>
      <p>Because of these results, we decided to
send the 5-gram results for the test set given
for the SEPLN 2014 task.</p>
      <p>In the case of the ranking based method,
we do not have to test di erent n-gram
combinations since we obtain a reliability for each
n-gram to be truthful. So if a certain n-gram
were systematically wrong it would have a
very low con dence which would not make it
so in uential. Finally we decided for
computational reasons to use only 6 n-grams.
4</p>
      <p>Conclusions and future work
In this paper we have described our approach
for the SEPLN 2014 shared task of tweet
language identi cation (TweetLID). Our system
is based on a pre-processing part taking into
account the di erent accents can appear in
di erent languages using language codi
cations in the N-Gram distribution state
without erasing them.</p>
      <p>Also we have two di erent algorithms the
linear interpolation smoothing and the
outof-place measure. These algorithms obtain
an F-measure of 0.736 and 0.578 respectively
in the given test corpus of 19993 tweets. Our
system ranked in the 3rd best place among
the participants of the constrained track,
using the linear interpolation algorithm, and
6th in the unconstrained track, using the
outof-place measure.</p>
      <p>Among the mistakes we made was to
underestimate numerical digits in languages,
which we removed. In the English language,
numbers are often used to shorten text, thus
making us lose great part of words for
example; "to forgive someone" might be written
as: '2 4give som1'. This is true in many
internet alphabets which are emerging such as
Arabizi(the arabic chat language).</p>
      <p>For possible future work for the
rankingbased method it might be interesting to
consider the distribution of the length of words in
each language since it can be a very
determining characteristic. Also in this method, the
out of place measure should have penalized
more severely the non non-appearing
characters in the document list instead of supposing
it could be found on the last element of the
list.</p>
      <p>Finally we have to stress the importance
the pre-processing of tweets as one of the key
parts in the project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Brants</surname>
          </string-name>
          , Thorsten.
          <year>2000</year>
          .
          <article-title>Tnt: A statistical part-of-speech tagger</article-title>
          .
          <source>In Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLC '00</source>
          , pages
          <fpage>224</fpage>
          {
          <fpage>231</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Carter</surname>
            , Simon,
            <given-names>Wouter</given-names>
          </string-name>
          <string-name>
            <surname>Weerkamp</surname>
            , and
            <given-names>Manos</given-names>
          </string-name>
          <string-name>
            <surname>Tsagkias</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Microblog language identi cation: Overcoming the limitations of short, unedited and idiomatic text</article-title>
          .
          <source>Lang</source>
          . Resour. Eval.,
          <volume>47</volume>
          (
          <issue>1</issue>
          ):
          <volume>195</volume>
          {
          <fpage>215</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Cavnar</surname>
          </string-name>
          ,
          <string-name>
            <surname>William</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>John M Trenkle</surname>
          </string-name>
          , et al.
          <year>1994</year>
          .
          <article-title>N-gram-based text categorization</article-title>
          . Ann Arbor MI,
          <volume>48113</volume>
          (
          <issue>2</issue>
          ):
          <volume>161</volume>
          {
          <fpage>175</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            , Xuedong,
            <given-names>Alex</given-names>
          </string-name>
          <string-name>
            <surname>Acero</surname>
            , and
            <given-names>HsiaoWuen</given-names>
          </string-name>
          <string-name>
            <surname>Hon</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Spoken Language Processing: A Guide to Theory, Algorithm, and</article-title>
          <string-name>
            <given-names>System</given-names>
            <surname>Development. Prentice Hall</surname>
          </string-name>
          <string-name>
            <surname>PTR</surname>
          </string-name>
          , Upper Saddle River, NJ, USA, 1st edition.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>In~aki San Vicente</source>
          , Arkaitz Zubiaga, Pablo Gamallo Jose Ramom Pichel In~aki Alegria Nora Aranberri Aitzol Ezeiza VA~ - ctor
          <string-name>
            <surname>Fresno</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Overview of tweetlid: Tweet language identi cation at sepln 2014</article-title>
          . In In TweetLID @ SEPLN
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Jelinek</surname>
          </string-name>
          , Frederick.
          <year>1997</year>
          .
          <article-title>Statistical Methods for Speech Recognition</article-title>
          . MIT Press, Cambridge, MA, USA.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>