=Paper= {{Paper |id=Vol-1228/paper3 |storemode=property |title=TweetSafa: Tweet Language Identification |pdfUrl=https://ceur-ws.org/Vol-1228/tweetlid-3-mendizabal.pdf |volume=Vol-1228 |dblpUrl=https://dblp.org/rec/conf/sepln/MendizabalCH14 }} ==TweetSafa: Tweet Language Identification== https://ceur-ws.org/Vol-1228/tweetlid-3-mendizabal.pdf

TweetSafa: Tweet language identification
TweetSafa: Identificación del lenguaje de tweets
Iosu Mendizabal Jeroni Carandell & Daniel Horowitz
(IIIA) Artificial Intelligence (UPC) Universitat PolitÃ¨cnica de Catalunya
Research Institute (URV) Universitat Rovira i Virgili
(CSIC) Spanish Council for (UB) Universitat de Barcelona
Scientific Research jeroni.carandell@gmail.com
iosu@iiia.csic.es daniel.horowitzzz@gmail.com

Resumen: Este artı́culo describe la metodologı́a utilizada en la tarea propuesta en SE-
PLN 14 para la identificación de lenguaje de tweets (TweetLID), como se explica en (Iñaki
San Vicente, 2014). El sistema consta de un preprocesamiento de tweets, creación de dic-
cionarios a partir de N-Grams y dos algoritmos de reconocimiento de lenguaje.
Palabras clave: Reconocimiento de lenguaje, lenguaje de tweets.
Abstract: This paper describes the methodology used for the SEPLN 14 shared task of
tweet language identification (TweetLID), as explained on (Iñaki San Vicente, 2014). The
system consists of 3 stages: pre-processing of tweets, creation of a dictionary of n-grams,
and two algorithms ultimately used for language identification.
Keywords: Tweet identification, tweet language.
1 Introduction and objectives guistic processing.
Language identification is vital as a prelimi- The rest of the article is laid out as fol-
nary step of any natural language processing lows: Section 2 introduces the architecture
application. The increasing use of social net- and components of the system: the pre-
works as an information exchange media is processing state where the tweets are adapted
making of them a very important informa- to a better comprehension for our algorithm
tion center. Twitter has become one of the and the used algorithms. Afterwards, section
most powerful information exchange mecha- 3 describes our results for the given problem.
nisms and every day millions of users upload To conclude, in section 4 we will try to draw
tons of tweets. some conclusions and propose future works.
The SEPLN 2014 TweetLID task focuses
on the automatic identification of the lan-
2 Architecture and components of
guage in which tweets are written, as the the system
identification of tweet language is arousing We have presented two different approaches
an increasing interest in the scientific com- to the problems which have been presented in
munity (Carter, Weerkamp, and Tsagkias, track one (constrained) and track two (un-
2013). Identifying the language will help to constrained). Both of these methods share
apply NLP techniques subsequently on the great part of the process in terms of the set
tweet such as machine translation, sentiment of tweets being used to learn from, as well
analysis, information extraction, etc. Accu- as the way incoming tweets are preprocessed
rately identifying the language will facilitate and learned.
the application of resources suitable to the
language in question. 2.1 Pre-processing
The scope of this task will focus on the The first step of this process, is to identify
top five languages of the Iberian Penin- the noise present in all tweets regardless of
sula:Spanish, Portuguese, Catalan, Basque, the language. There are common issues re-
and Galician as well as English. These lated to regular text, such as multiple space
languages are likely to co-occur along with characters, but also specific Twitter tokens
many news and events relevant to the Iberian like the user name tag or emoticons. After
Peninsula, and thus an accurate identifica- identifying this issues, we are able to remove
tion of the language is key to make sure that them using mostly regular expressions. We
we use the appropriate resources for the lin- have highlighted the main issues found in the
tweet domain and what our approach was to- distribution and comparing it with the lan-
wards it: guages distributions. To do so, we took two
different approaches.
• Different case characters: All characters
were lowercased, so they wouldn’t inter- 2.3.1 Linear Interpolation
fere in the identification process, since The first method tries to find out what the
the same character with different cases probability is of a sentence being generated
is treated as two different elements. by each language by multiplying the prob-
ability of the consecutive N-Grams of the
• Numbers, Emoticons: Since these kind sentence in their respective languages. The
of characters are presented equally in problem appears when we deal with a small
any language, they have been removed. finite dataset and there are therefore not
• Vowel repetitions: The vowel repeti- enough instances to reliably estimate the
tion is a common issue when dealing probability, in other words, the sparse data
with chatspeak. These kind repetitions problem appears. This means that if a corpus
could damage the algorithm’s perfor- of a certain language does not have a certain
mance, therefore they were completely N-Gram, a sentence with the latter would au-
removed and reduced to a maximum of tomatically have a probability of zero.
two from the text using regular expres- To avoid this problem in the computation
sions. of the probabilities of each tweet for the lan-
guages of our N-Gram distribution we use the
• Multiple spaces: This is also a common
linear interpolation smoothing method, also
issue when dealing with tweets. The
known as the Jelinek-Mercer smoothing (Je-
regular expression formats the text from
linek, 1997),(Huang, Acero, and Hon, 2001).
multiple spaces into one space character.
To be able to use this smoothing method
When working with N-grams, it is important we have to make a computation with our
to observe that not all special characters are N-Gram corpus, the one generated with the
to be removed from the text, since they could 14991 tweets for the training purpose, to cal-
interfere in the identification process. Char- culate the λ values. We create a dynamic
acters like the apostrophe, are more likely to program to compute as many λ values as the
appear in English and Catalan than in oth- N-Grams we extracted from the training set.
ers, therefore this kind of special characters For instance, if we consider up to 5 N-Gram
must not be considered as noise, and we save distributions for English we will compute 5
them for a better result. λ’s for each N-Gram up to 5, so all λi corre-
sponding to the i-Gram where i ∈ {1, ..., 5}.
2.2 N-GRAM distribution The probability of an N-Gram will be com-
To classify the tweets into languages using puted as follows:
N-grams we have to extract meaningful dis-
tributions from each language. To do so, n i−1
P(tn |t1 , t2 , ..., tn−1 ) = λi P̂ (tn |
P T
we created documents of concatenated tweets tn−j )
i=1 j=1
for each language: English, Spanish, Cata-
(1)
lan, Portuguese, Galician, Basque, other and
undetermined. Mixed labelled tweets such
as the ones with ’en+es’ as well as those For any n and where P̂ are maximum like-
with ambiguous languages ’en/es’ are added lihood estimates of the probabilities and
n
to both languages they contain (in this case P
λi = 1, so P represent probability distri-
to both Spanish ’es’ and English ’en’). Then i=1
we extract N-gram distributions in a dynamic butions.
way so that we can choose the number of N The values of λ are computed by deleted
we wish. interpolation (Brants, 2000). This technique
successively removes each max-gram (biggest
2.3 Algorithms n-gram) from the training corpus and esti-
Once we have N-gram distributions for each mates the best values for the λ’s from all
language, given a new tweet we want to clas- other n-grams in the corpus by adding a con-
sify we are going to find the most possible fidence to the lambdas for the most propor-
language by extracting the tweet’s N-gram tionally seen N-Gram . The algorithm is
given in Algorithm 1. language for a fixed n-gram. Given the tweet
ranking {Tin }i and a language L n-gram rank-
set λ1 = λ2 = λ3 = ... = λn = 0; ing {Lnj }j , the distance is computed by the
foreach MAX-Gram (t1 , ..., tn ) with sum of the number of indexes that an element
count (t1 , ..., tn ) > 0 do of T has been displaced in list D. So we sum
| i − j | for every Tin in the tweet that is equal
depending on the maximum of the to Lnj . In the case that an element in {Tin }i
next values: does not exist in list {Lnj }j , we suppose the
case count(t
N −1
n )−1
: increment λ1 by best case, i.e. that the non appearing ele-
count(t1 , ..., tn ) ment is in the bottom of the list. This as we
will discuss in section 4 might not have been
end such a good idea. Finally, to be able to com-
case count(t n−1 ,tn )−1
count(tn−1 )−1 : increment λ2 pare different distances we need some kind of
by count(t1 , ..., tn ) proportion of the out-of-place measure that
end we describe as:
case count(t n−2 ,tn−1 ,tn ))−1
count(tn−2 ,tn−1 )−1 : increment
outOf P laceM easure
(2)
λ3 by count(t1 , ..., tn ) length(Tin i ) ∗ length(Lnj j )
end As we can see in Figure 1, the out-of-place
...
count(t1 ,...,tn )−1 measure is calculated for a tweet from an En-
case count(t1 ,t2 ,...,tn−1)−1
: increment glish dataset. The m and n parameters give
λn by count(t1 , ..., tn ) us the maximum number of elements we al-
end low for each list so that computational time
end does not get compromised by an unnecessary
Algorithm 1: Deleted interpolation Algo- whole search of all the n-grams in a language
rithm. (Cavnar, Trenkle, and others, 1994). This
is the part of this algorithm that makes it
unconstrained, since the parameters we used
2.3.2 Out-of-place measure came from a previous similar project we did
For this next method, for every n we will only using self downloaded tweets and where we
consider a ranking list of n-grams ordered by found that the values of m=80 and n=50
most to least frequent and where only the were best. To avoid possible divisions by
order is preserved as opposed to the exact
frequencies. We decided to do this because
when it comes to comparing a single tweet
(documents of only 140 characters) to distri-
butions of each language, we cannot consider
that the frequency distribution of the tweet
n-grams will resemble the ones in the con-
catenated document. We can however, say
that the most frequent have a higher proba-
bility of appearance, but not necessarily with
proportional frequencies as in the document.
For this reason, we used the out-of-place mea-
sure. Figure 1: Example of an out-of-place measure
We decided to send this method as uncon-
strained because two of the parameters which zero in equation 2, given that tweets are
we used, that will be discussed later on, were sometimes zero or very close (especially af-
extracted from a previous work we did with ter the cleaning of html’s, punctuation, etc),
a self downloaded corpus of tweets of differ- we supose that if the number of characters
ent languages. We did this because it would is smaller than three, the tweet is undeter-
take too long if we had to find the new values mined. Again, a bold affirmation which needs
because of the huge search space. to be fine-touched in future work.
This measure is a distance which will tell Finally, in the training process, we are
us approximately how far the tweet is from a going to reward each n-gram if it correctly
guessed a tweet. So if for example, a trigram In figure 2 we can see the results of the
labels a tweet correctly but the unigrams and experiments we made using the linear inter-
bigrams do not, we reward the trigrams with polation method. We can observe how the
one point where the others don not get any. results are going better while the N-Grams
We do this with all the tweets in the train- are going bigger, but the peak of the results
ing set and in the end we get frequency of are achieved with the 5-gram, from there on
reliability of each n-gram. When the test is the results are slightly worst each gram we
done on a tweet, a weighted voting is done us- sum.
ing these confidence parameters so that the
most voted languages counting the reliability
weight wins.

3 Setup and evaluation
The official result of our approach are the
next ones: In the constrained category us-
ing the linear interpolation algorithm, section
2.3.1, we obtained a precision of 0.777, a re-
call of 0.719 and a F-measure of 0.736.
In the unconstrained category we used the
out-of-place measure algorithm, section 2.3.2,
and obtained the next results: precision of
0.598, recall of 0.625 and F-Measure of 0.578.
3.1 Empirical settings
Before submitting the final results we made
different executions with different maximum Figure 2: Results obtained for the training
N-Grams to know which was the one with the set with: Linear interpolation method.
best results. Also because of the ambiguity
of tweets with more than one language, for Because of these results, we decided to
instance es+en, to compute this we take av- send the 5-gram results for the test set given
erage value of all the probabilities of all the for the SEPLN 2014 task.
languages and then create a threshold. For In the case of the ranking based method,
the linear interpolation we used: we do not have to test different n-gram com-
binations since we obtain a reliability for each
maxP robability − Average n-gram to be truthful. So if a certain n-gram
T hreshold =
α were systematically wrong it would have a
(3) very low confidence which would not make it
Where the maxP robability refers to the max- so influential. Finally we decided for compu-
imum of the probabilities of the languages tational reasons to use only 6 n-grams.
and α < 0 is the value of restriction that tol-
erates more or less the number of languages 4 Conclusions and future work
that may be suggested. The bigger the α,
the less tolerance to ambiguity of predicted In this paper we have described our approach
languages for each tweet yet the more precise for the SEPLN 2014 shared task of tweet lan-
the result, while the smaller the alpha, the guage identification (TweetLID). Our system
higher the recall yet smaller the precision. is based on a pre-processing part taking into
For the ranking-based method, the thresh- account the different accents can appear in
old is chosen by running a search from 0 to different languages using language codifica-
0.3 with intervals of 0.05. The most optimum tions in the N-Gram distribution state with-
found on the data set is 0.05. out erasing them.
Also we have two different algorithms the
3.2 Empirical evaluation linear interpolation smoothing and the out-
We ran experiments with different N-Gram of-place measure. These algorithms obtain
values, from 1 to 8, and we set the α value an F-measure of 0.736 and 0.578 respectively
to 10 which gave us the best results in the in the given test corpus of 19993 tweets. Our
validation set. system ranked in the 3rd best place among
the participants of the constrained track, us- lid: Tweet language identification at sepln
ing the linear interpolation algorithm, and 2014. In In TweetLID @ SEPLN 2014.
6th in the unconstrained track, using the out-
Jelinek, Frederick. 1997. Statistical Methods
of-place measure.
for Speech Recognition. MIT Press, Cam-
Among the mistakes we made was to un-
bridge, MA, USA.
derestimate numerical digits in languages,
which we removed. In the English language,
numbers are often used to shorten text, thus
making us lose great part of words for exam-
ple; ”to forgive someone” might be written
as: ’2 4give som1’. This is true in many in-
ternet alphabets which are emerging such as
Arabizi(the arabic chat language).
For possible future work for the ranking-
based method it might be interesting to con-
sider the distribution of the length of words in
each language since it can be a very determin-
ing characteristic. Also in this method, the
out of place measure should have penalized
more severely the non non-appearing charac-
ters in the document list instead of supposing
it could be found on the last element of the
list.
Finally we have to stress the importance
the pre-processing of tweets as one of the key
parts in the project.

References
Brants, Thorsten. 2000. Tnt: A statisti-
cal part-of-speech tagger. In Proceedings
of the Sixth Conference on Applied Natu-
ral Language Processing, ANLC ’00, pages
224–231, Stroudsburg, PA, USA. Associa-
tion for Computational Linguistics.
Carter, Simon, Wouter Weerkamp, and
Manos Tsagkias. 2013. Microblog lan-
guage identification: Overcoming the lim-
itations of short, unedited and idiomatic
text. Lang. Resour. Eval., 47(1):195–215,
March.
Cavnar, William B, John M Trenkle, et al.
1994. N-gram-based text categorization.
Ann Arbor MI, 48113(2):161–175.
Huang, Xuedong, Alex Acero, and Hsiao-
Wuen Hon. 2001. Spoken Language Pro-
cessing: A Guide to Theory, Algorithm,
and System Development. Prentice Hall
PTR, Upper Saddle River, NJ, USA, 1st
edition.
Iñaki San Vicente, Arkaitz Zubiaga, Pablo
Gamallo José Ramom Pichel Iñaki Ale-
gria Nora Aranberri Aitzol Ezeiza VÃ-
ctor Fresno. 2014. Overview of tweet-