=Paper= {{Paper |id=Vol-1445/tweetmt-1-gonzalez |storemode=property |title=An Analysis of Twitter Corpora and the Differences between Formal and Colloquial Tweets |pdfUrl=https://ceur-ws.org/Vol-1445/tweetmt-1-gonzalez.pdf |volume=Vol-1445 |dblpUrl=https://dblp.org/rec/conf/sepln/Gonzalez15 }} ==An Analysis of Twitter Corpora and the Differences between Formal and Colloquial Tweets== https://ceur-ws.org/Vol-1445/tweetmt-1-gonzalez.pdf
    An Analysis of Twitter Corpora and the Differences between
                   Formal and Colloquial Tweets∗
     Análisis de Varios Corpus de Twitter y las Diferencias entre Tweets
                            Formales y Coloquiales
                                    Meritxell Gonzàlez
                    Oxford University Press, Oxford, United Kingdom
                   Universitat Politècnica de Catalunya, Barcelona, Spain
                            meritxell.gonzalezbermudez@oup.com

      Abstract:        This work reviews recent publications addressing the Twitter
      translation task, and highlights the lack of appropriate corpora that represents the
      colloquial language used in Twitter. It also discusses the most well-know issues in
      the Twitter genre: the use of hashtags and the amount of OOVs, with especial focus
      in comparing the differences between formal and colloquial texts.
      Resumen:       Este trabajo resume las publicaciones recientes en el área de la
      traducción automática de tweets, destacando la falta de un corpus que represente
      el lenguaje coloquial presente en Twitter. También se tratan los problemas más
      conocidos del género de Twitter: el uso de hashtags i la gran cantidad de palabras
      OOV, con especial enfoque en las diferencias entre tweets formales y coloquiales.

      Keywords/Palabras clave: corpus, tweets, hashtags, and OOV
1    Introduction                                   occur, while the second has been typically
The success and increasing popularity of            addressed by combining large amounts of
microblogging has raised the need to analyse        general purpose data and smaller subsets of
and process its content. Traditional methods        domain specific datasets. The creation of
for natural language processing fail when           a gold standard in MT requires the use of
applied over these texts.          The reason       parallel data that helps to assess the quality
is not circumscribed to few nor simple              of the output.
issues. Roughly, microblogs documents do                When      addressing     the    automatic
not follow the traditional structure of a           translation within the microblogging
formal text or document, they use a number          genre, one has to deal with the additional
of language variants, styles and registers          difficulty of having little or no context and
among other linguistic phenomena, and can           the fact that microblogs exhibit fleeting
even include multimedia content as a way of         domains.      Twitter is not different from
communication (Jehl, 2010; Fabrizio Gotti           other microblogs, and has, in addition, its
and Phillippe Langlais and Atefeh Farzindar,        own particularities. As described in (Jehl,
2014; Kaufmann, Max and Kalita, Jugal,              2010), tweets actually share the spontaneity
2010; Bertoldi, Nicola and Cettolo, Mauro           and expressiveness of the spoken language,
and Federico, Marcello, 2010).                      but limited to 140 characters. Due this
   Machine Translation (MT) is a hard task          constraint, tweets have usually a very
within the natural language processing field.       simple syntax. However, they are mined
It has received considerable attention during       of ungrammaticalities, misspellings and an
the last decades, and it is still an active field   unlimited number of lexical variants created
with many research challenges. As in other          out of the human imaginary and the common
natural language processing tasks, it counts        ground of part of the audience.
among its difficulties the ambiguity of the             In this document, Section 2 summarises
language, and the need of corpora and a gold        recent studies in this field and different
standard. The former can be addressed by            approaches followed to address these
analysing the context in which a sentence           phenomena. Next, Sections 3 to 5 give a
∗
  This work was partially funded by the TACARDI
                                                    numerical analysis of 6 different corpora
project (TIN2012-38523-C02) of the Spanish          of tweets written in Basque, Catalan, and
Ministerio de Economı́a y Competitividad.           Spanish. The goal of this analysis is to
sketch the content of the Twitter messages        deal with the noisy input from colloquial
(tweets), highlight which are their principal     texts, but either they do not belong to
characteristics and discuss the differences       the Twitter genre or they do not contain
between formal and colloquial tweets.             parallel data. (Kaufmann, Max and Kalita,
                                                  Jugal, 2010) describes an MT system able
2     Recent Work on Twitter                      to translate from colloquial English into
      Translation                                 standard English. The rationale is that
                                                  traditional NLP techniques can be applied
The automatic translation of tweets, in
                                                  over standardised text. Their methodology
general, is more difficult than regular MT.
                                                  includes the use of aligned data from a
Although the MT community has already
                                                  corpus of SMSs that contains most common
addressed the translation of tweets, there are
                                                  acronyms and short forms. (Bertoldi, Nicola
still few works in this area, mainly because
                                                  and Cettolo, Mauro and Federico, Marcello,
of the lack of corpora, and especially those
                                                  2010) and (Formiga, Lluı́s and Fonollosa,
showing a fair representation of colloquial
                                                  José A. R., 2012) address the problem
texts. The number of authors publishing
                                                  of translating noisy input.        The former
content in multiple languages is not small,
                                                  by trying to simulate and generate noisy
but their messages tend to be correct and
                                                  input automatically; the latter by adding a
well structured, in contrast to those posted
                                                  preprocessing layer to convert the input into
by the gross of the users.
                                                  clean text. Finally, the corpus described
2.1    Twitter Corpora                            in (Alegria, Iñaki and Aranberri, Nora
                                                  and Comas, Pere R and Fresno, Vıctor
The availability of parallel corpora for
                                                  and Gamallo, Pablo and Padró, Lluis and
Twitter is growing but still scarce. The
                                                  San Vicente, Iñaki and Turmo, Jordi and
following four works gathered parallel data
                                                  Zubiaga, Arkaitz, 2014) was distributed to
following diverse approaches, but them all
                                                  the participants of the TweetNorm shared
contain formal texts only. (Gotti, Fabrizio
                                                  task. This is a monolingual corpus of Spanish
and Langlais, Philippe and Farzindar,
                                                  tweets. Since this corpus has been used in
Atefeh, 2013) gathered data from Canadian
                                                  this study it is further detailed in Section 3.
Government Agencies, written in French
and English. This work describes an MT            2.2   Linguistic Phenomena
system that uses in-domain parallel data
crawled from the links appearing in the           Although the previous works addressed
tweets. Hence, tuning was conducted with          different problems, they share a common
documents from the same domain. The               ground on the principal difficulties of the
corpus built in (Ling, Wang and Marujo,           Twitter genre. First, the translation of
Luis and Dyer, Chris and Black, Alan W            hashtags is an open issue that includes its
and Trancoso, Isabel, 2014) contains tweets       segmentation, identification and analysis of
written in Chinese and English. This work         its role in sentences (Fabrizio Gotti and
describes a tool and a methodology to help        Phillippe Langlais and Atefeh Farzindar,
users to identify parallel excerpts in the        2014). Second, the correct tokenisation of
messages and to annotate their boundaries.        the text is essential but difficult due the
The data obtained with this method was            extreme noisiness of the text. Also, making
fairly cheap (crowd-sourced) and it resulted      the translation fit in 140 characters can harm
to have a high degree of quality. (Jehl, Laura    the quality of the output, although (Jehl,
and Hieber, Felix and Riezler, Stefan, 2012)      2010) addressed this issue in her thesis and
used a corpus of Arabic sentences that were       reported good results.
manually translated into English. The data            The increasing interest in the field has
was crawled by filtering the topic (Arabic        promoted the design of tools to create
Spring) and was cleaned and pruned, also          especialised corpora. However, the human
by means of crowd-sourcing. Finally, the          translation of tweets also raises open
shared task described in (Alegria et al., 2015)   questions (S̆ubert and Bojar, 2014). For
distributed a collection of parallel corpora      instance, how to translate idioms and slang,
in the languages spoken in the Iberian            out-of-vocabulary words, onomatopoeias,
peninsula. These corpora have been used in        emphasises (jajaaaaa), or irony. But also,
this study and they are detailed in Section 3.    how to approach the translation of hashtags
    In contrast, the following four works         and symbols (such as emoticons), how to
interpret wrong syntax, find the translated                               CAES.ca     CAES.es
version of a link, and fit the final translation      # tweets               4, 000      4, 000
into 140 characters, among others.                    # tokens              66, 559     66, 113
    All in all, the creation of synthetic corpus      avg. tokens/tweet      16.39       16.53
to simulate these phenomena seem a feasible                               EUES.eu     EUES.es
approach (Bertoldi, Nicola and Cettolo,               # tweets               4, 000      4, 000
Mauro and Federico, Marcello, 2010), yet out          # tokens              58, 368     51, 782
of the scope of this study. Last, but not least,      avg. tokens/tweet      14.59       12.94
an appropriate methodology and measures                                   TNORM          TSM
to assess the quality of Twitter translations
                                                      # tweets               1, 132      8, 571
including its particular characteristics has          # tokens              14, 497    123, 679
not been addressed so far.                            avg. tokens/tweet      12.80       14.43

3   Description of the Used                        Table 1: Statistics on number of tweets and
    Corpora                                        tokens in each corpus.
The next sections analyse six datasets of
tweets from the Tweet-Norm (Alegria, Iñaki        general domain set of tweets randomly
and Aranberri, Nora and Comas, Pere R              selected.     So similarly to TNORM, it
and Fresno, Vıctor and Gamallo, Pablo and          contains both formal and colloquial tweets.
Padró, Lluis and San Vicente, Iñaki and          They were manually processed to classify
Turmo, Jordi and Zubiaga, Arkaitz, 2014),          them according to the language of the
Tweet-MT (Alegria et al., 2015) and Social         tweet and annotate different layers such
Media (Roser Saurı́, 2013) corpora. The            as communication function, polarity, target,
goal is to discuss a few of the phenomena          and topic. This process included some clean
mentioned in the previous section.                 up of the twitter mark-up for privacy reasons.
   A set of four datasets was obtained             Hence, the author id and user mentions,
from the Tweet-MT corpora. It consists             hashtags and URLs were substituted with
of 2 bitexts for Catalan–Spanish and               the labels @USER, #HASHTAG and [URL],
Basque–Spanish language pairs. The four            respectively.
datasets contain both, the development and            The six datasets were processed to
the test sets for each language: CAES.ca,          have similar characteristics: the tokens
CAES.es, EUES.eu and EUES.es.            The       that correspond to the author id and RT
tweets in these datasets were obtained from        (re-tweet) were removed when present, and
a sample of manually selected accounts             they were tokenised using an adaptation
of authors that tend to tweet in various           to Spanish and Catalan languages of the
languages, being namely public organisations       Twokenize tool (Brendan O’Connor and
and personalities. Hence, the content of the       Michel Krieger and David Ahn, 2010).
messages is mainly formal, i.e., they do not       Table 1 shows the number of tweets, the
contain misspellings and do not abuse of the       number of tokens and the average number of
use of symbols.                                    tokens per tweet in each corpus. Regardless
   The fifth dataset, TNORM, was obtained          the differences in nature of the datasets and
from the Tweet-Norm corpus that gathered a         their size, they show a similar number of
random selection of geolocated tweets within       tokens per tweet, being CAES.ca the dataset
the Iberian peninsula, excluding multilingual      with longer ones and EUES.eu the shortest.
areas where other languages than Spanish           The messages in the two colloquial corpora
are spoken.     The corpus was processed           TNORM and TSM seem to have slightly
to identify and annotate out-of-vocabulary         shorter posts compared with their formal
words. Hence, it contains not only correct         ones in the same language CAES.es and
messages, but also colloquial ones. The            EUES.es.
dataset used in this work contains the two            Although tweets are similar in length,
development sets and the test provided in the      a deeper analysis of their content shows
workshop.                                          remarkable differences between the formal
   The last dataset used in this work              and the colloquial corpora. This section
is TSM. It is a portion of the Social              analyses the use of user mentions and
Media Corpus, and in particular the corpus         URLs whereas Section 4 analyses the use
of tweets in Spanish.         It contains a        of hashtags. Although dealing with user
                         CAES.ca    CAES.es                                 CAES.ca      CAES.es
  # @users                    743        873      # hashtags                    3, 286       3, 821
  avg. @users/tweet          0.18       0.22      # hashtag types                  198          430
  % @users wrt. tokens     1.13%      1.32%       # avg. hashtags/tweet           0.82         0.96
                                                  % hashtags wrt. tokens        5.01%        5.78%
  # URLs                   3, 511     3, 525
                                                  # tweets > 1 hashtag          1, 520       1, 750
  avg. URLs/tweet            0.88       0.88
  % URLs wrt. tokens       5.36%      5.33%                                 EUES.eu      EUES.es
                         EUES.eu    EUES.es       # hashtags                    4, 828       4, 608
                                                  # hashtag types                  584          438
  # @users                 1, 947     2, 070
                                                  # avg. hashtags/tweet           1.21         1.52
  avg. @users/tweet          0.49       0.52
                                                  % hashtags wrt. tokens        8.27%        8.90%
  % @users wrt. tokens     3.76%      3.55%
                                                  # tweets > 1 hashtag          2, 358       2, 364
  # URLs                   3, 461     3, 458
                                                                            TNORM            TSM
  avg. URLs/tweet            0.86       0.86
  % URLs wrt. tokens       6.68%      5.92%       # hashtags                       182       1, 046
                                                  # hashtag types                  157            1
                         TNORM        TSM
                                                  # avg. hashtags/tweet           0.16         0.12
  # @users                    665     3, 439      % hashtags wrt. tokens        1.26%        0.85%
  avg. @users/tweet          0.59       0.40      # tweets > 1 hashtag             103          744
  % @users wrt. tokens     4.59%      2.78%
  # URLs                       69        743     Table 3: Statistics on hashtag use in each
  avg. URLs/tweet            0.06       0.09     dataset.
  % URLs wrt. tokens       0.47%      0.60%

                                                 4    On the Importance of the
Table 2: Statistics on user mentions (@users)
and URLs use in each corpus.
                                                      Hashtag Occurrences
                                                 This section analyses the use of hashtags in
                                                 the datasets. This study and the next one in
                                                 Section 5 follow the procedure in (Fabrizio
mentions (@user) and links is not a big issue,   Gotti and Phillippe Langlais and Atefeh
they are discussed here to stand out how they    Farzindar, 2014) that resulted very clear and
are used in Twitter. Table 2 gives the figures   appropriate to this end. Table 3 shows some
for the use of @user and URLs in the body        statistics on the occurrences of hashtags.
of the messages. @user do not seem to follow     The different number of hashtags between
any pattern. The number of @user in the two      formal and colloquial datasets is noticeable.
bitexts of the TweetMT datasets is opposite:     The former contains more than one hashtag
the EUES datasets contain more than twice        per tweet, whereas the latter contains a
@user than the CAES ones, and almost three       remarkable low number of them.1 It seems
times the proportion of @user with respect       to indicate that formal tweets tend to use
to the number of tokens. Similarly, the          hashtags to categorise its topic and, maybe,
TNORM dataset shows a higher use of @user        create a trend. This is also reflected in
than the TSM one. It is worth to note that       Figure 1: the most of the formal tweets,
not all @user tokens have their counterpart      in the bitexts, contain one or two hashtag,
in the translated text, even though this token   whereas the most of the colloquial ones have
does not need to be translated.                  none.
                                                    A more interesting issue is the translation
   In contrast, the use of URLs seems to be      of hashtags. In terms of the number of
consistent across the two types of datasets.     occurrences, each side of the bitexts contain
The four bitexts contain almost the same         a similar amount. However, the number
number of URLs, and we can find almost           of hashtag types in CAES.ca is much lower
one URLs in each tweet. In return, TNORM         than the ones in CAES.es. A peer review
and TSM contain a remarkable small number        of the hashtag sets reveals that the Spanish
of URLs, less than 0.1% per tweet. Out of        versions contain more written variants than
curiosity, the majority of URLs in the bitexts   their counterparts in Catalan. For instance,
link to documents in the same language           the hashtag “#revistapremsa” (Catalan) has
as the tweet.      Given that the selected       four variants in the Spanish text: “#revista”,
authors post multilingual messages, it seems
reasonable that they also link to the right         1
                                                      The number of hashtag types in TSM is 1 because
URL when available.                              the corpus contains only the #HASHTAG label.
                                                                                  CAES.ca    CAES.es
                               TSM
3                                                     % tweets with a prologue       2.85%      3.42%
                               TNORM                  % tweets with an epilogue      43.6%     49.48%
                                                      % of # in a prologue           3.50%      3.61%
                               EUES.es                % of # in an epilogue         75.72%     73.46%
                               EUES.eu                                            EUES.eu    EUES.es
2
                               CAES.es                % tweets with a prologue      10.28%     10.90%
                               CAES.ca                % tweets with an epilogue     55.23%     55.13%
                                                      % of # in a prologue           9.13%     10.63%
                                                      % of # in and epilogue        57.27%     60.11%
1
                                                                                  TNORM         TSM
                                                      % tweets with a prologue       2.03%      2.39%
                                                      % tweets with an epilogue      5.74%      3.83%
                                                      % of # in a prologues         17.03%     20.08%
0                                                     % of # in a epilogues         40.66%     35.09%


    0%   20% 40% 60% 80% 100%                     Table 4: Statistics on hashtag (#) use as
                                                  prologues and epilogues in each dataset.
Figure 1: % tweets with exactly n hashtags,
for n ∈ [0, 1, 2, 3].                             colloquial texts, roughly half of them appear
                                                  inline, and hence, they play a syntactic role
“#revistadeprensa”, “#revistaprensa”, and         in the message. This is important since
“#revistaprensa”.                                 they may contain an essential part of the
    According to (Fabrizio Gotti and              semantics and thus worth to deal with them.
Phillippe Langlais and Atefeh Farzindar,          Unfortunately, hashtags contains mainly of
2014), hashtags can be classified by the          out-of-vocabulary words, as discussed next
role they play in the text. They distinguish      in Section 5.
between hashtags that appear at the
beginning of the text (prologue), in the text     5      On the OOV words in Twitter
(inline) and at the end of the text (epilogue).   The use of out-of-vocabulary (OOV) words
Correctly identifying this role is important      in Twitter has been claimed to be a hard
since a number of hashtags may have a             issue. The reason is not only the high number
syntactic function inside the text (inline), or   of misspellings, symbols and orthographic
can help to identify the domain of the text       errors, that could be partially tackled by
(prologue and epilogue). A simple heuristic       using spell-checkers, but also the use of
was used to split the tweets into these three     specific lexica and lexical variants.      For
parts, and the results shown are in line with     instance, the use of word combinations (e.g.,
the mentioned study. We can observe, in           in hashtags), the combination of different
Table 4, how the hashtag role within the          languages (especially in multilingual regions,
text varies in each corpus. Although in           but also English terms) and the unlimited
different proportion, the gross of hashtags in    ability of the microblogging sphere to invent
the formal datasets appear in the epilogue,       new terms.
which indicates there is a common practice           This section gives a numerical analysis
to add any hashtag at the end of the tweet.       of OOVs that occur in Twitter. In order
In contrast, the colloquial datasets have a       to conduct this analysis, the datasets were
very few proportion of tweets with either         processed to remove the user mentions and
a prologue or an epilogue, but a higher           URLs, since them all are tokens that do
proportion of them appear in the prologues        not need to be translated. Some variants
(in comparison to the formal tweets).             of the datasets were built. First, only the
This behaviour may simply indicate that           CAES bitext was used due the lack of a
colloquial tweets do not follow necessarily       Language Model (LM) for Basque. Then,
any common practice. All datasets actually        since the TNORM annotations provide the
exhibit a low rate of tweets having a prologue,   corrected forms for some OOV tokens (only
although the EUES bitext show a remarkable        spelling variants), they were used to build
higher number in comparison to the rest.          a new dataset TNORM-S were OOVs were
Finally, it is worth to note that, although       substituted with the correct word when
the number of hashtags is lower in the            available. In addition, two different versions
                       CAES.ca       CAES.es                                 TSM      CAES.es
 # OOV - clean data        5.61%         5.14%        # OOV - clean data    11.08%      11.26%
 # OOV - no hashtags       2.81%         2.20%        # OOV - no hashtags   10.30%       7.51%
 ppl - clean data             603           644       ppl - clean data          591         735
 ppl - no hashtags            520           543       ppl - no hashtags         591         669
                       TNORM        TNORM-S
                                                  Table 6: Count of OOVs and the perplexity
 # OOV - clean data       14.23%        12.45%
 # OOV - no hashtags      13.53%        11.79%
                                                  (ppl) in the TSM and CAES.es corpora using
 ppl - clean data          1, 325        1, 211   a LM trained on the TNORM corpus.
 ppl - no hashtags         1, 300        1, 192
                           TSM
                                                  available out of the two colloquial ones. The
 # OOV - clean data        9.18%                  new LM was used to obtain the % of OOVs
 # OOV - no hashtags       8.38%                  and perplexity estimations on CAES.es and
 ppl - clean data          1, 370
 ppl - no hashtags         1, 373                 TSM datasets. The results are shown in
                                                  Table 6. The % of OOVs is higher in both
Table 5: Count of OOV and perplexity (ppl)        cases, most probably due the small size of the
estimation in each corpus using a LM trained      corpus. However, the perplexity of the TSM
on the “El Periódico” corpus. (This parallel     dataset has decreased. This seems to indicate
corpora is listed in the ELRA catalog)            that the LM was able to capture a high
                                                  proportion of the particular characteristics
                                                  of colloquial tweets, and that these may be
were created out of each dataset. In the          recurrent in the colloquial genre and do not
first one (clean data), the hashtags were kept    appear in formal texts.
(the # symbol was removed) since they play
an important role in the text, carry part
                                                  6    Conclusions and Further Work
of the semantics of the message and need
to be translated in most of the cases. In         Twitter has its own particularities that
the second dataset, all the hashtags were         makes it a hard genre to deal with. This
removed. The purpose of this second version       work reviews recent publications that address
is to highlight the impact of hashtags in the     the problem of Twitter translation. The
perplexity estimation of the texts.               number of works in this field is still scarce
    Table 5 shows the results of this analysis.   due the lack of corpora, but also because
As expected, colloquial datasets contain a        of the lack of a gold standard and specific
higher number of OOVs. The TNORM-S                evaluation methodologies that can help to
contains slightly a lower number of them in       assess the quality of a tweet translation.
comparison to the non-normalised version,         This work also discusses the most well-know
which indicates that the use of spell-checkers    issues in the Twitter genre: the use of
and the substituion of lexical variants in        hashtags and the amount of OOVs, with
not enough to deal with OOVs. This is             especial focus on comparing the differences
reflected in the figures on the perplexity of     between formal and colloquial texts. The
the datasets. The perplexity is high across       results obtained are preliminary, but they
all the datasets, and it slightly decreases       clearly show that these two registers are
after removing the hashtags from the data,        different not only from a linguistic point of
indicating that the language used in the text     view, but also in terms of tweet structure
is notable different from the LM. This can        and content. Further work has to be done
be ascribed to the fact that the LM was           to align the hashtags and the OOVs in
build using an out-of-domain corpus. In           bitexts corpora and analyse the way their
turn, removing the hashtags from the data         are translated. Also, the annotation layers
decreases the amount of OOVs, and seems           of the TSM corpus enables the possibility
to have an impact only in the formal dataset,     to fine-grain the study, for instance, by
where half of the OOVs occur in the hashtags.     focusing in the differences between tweets
However, their proportion is smaller when         with different communication functions. To
compared with the colloquial datasets.            conclude, no major differences were found
    For the sake of comparison, the same          between languages, but this may be ascribed
calculation was carried on using a LM trained     to the fact that the datasets were obtained
on TNORM corpus, the only corpus publicly         from bitexts corpora.
References                                         Solutions. In Proceedings of the Workshop
Alegria, Iaki, Nora Aranberri, Cristina            on Language Analysis in Social Media,
   Espaa-Bonet, Pablo Gamallo, Hugo G.             pages 80–89, Atlanta, Georgia, June.
   Oliveira, Eva Martı́nez, Iaki San Vicente,      ACL.
   Antonio Toral, and Arkaitz Zubiaga.          Jehl, Laura. 2010. Machine Translation for
   2015. Overview of TweetMT: A Shared             Twitter. Master’s thesis, University of
   Task on Machine Translation of Tweets at        Edimburgh, United Kingdom.
   SEPLN 2015. In Proceedings of the Tweet
   Translation Workshop co-located with 31th    Jehl, Laura and Hieber, Felix and Riezler,
   Conference of the Spanish Society for           Stefan. 2012. Twitter Translation Using
   Natural Language Processing, Alacant,           Translation-based Cross-lingual Retrieval.
   Spain, September.                               In Proceedings of the Seventh Workshop
                                                   on Statistical Machine Translation, WMT
Alegria, Iñaki and Aranberri, Nora and            ’12, pages 410–421, Stroudsburg, PA,
   Comas, Pere R and Fresno, Vıctor and            USA. ACL.
   Gamallo, Pablo and Padró, Lluis and
   San Vicente, Iñaki and Turmo, Jordi and     Kaufmann, Max and Kalita, Jugal. 2010.
   Zubiaga, Arkaitz. 2014. TweetNorm es           Syntactic normalization of Twitter
   Corpus: an Annotated Corpus for Spanish        messages.     In Proceedings of the
   Microtext Normalization. In Proceedings        International Conference on Natural
   of the Ninth International Conference on       Language Processing, Kharagpur, India.
   Language Resources and Evaluation.           Ling, Wang and Marujo, Luis and
                                                   Dyer, Chris and Black, Alan W and
Bertoldi, Nicola and Cettolo, Mauro and
                                                   Trancoso, Isabel. 2014. Crowdsourcing
  Federico, Marcello. 2010. Statistical
                                                   High-Quality Parallel Data Extraction
  Machine Translation of Texts with
                                                   from Twitter.    In Proceedings of the
  Misspelled Words. In Proceedings of the
                                                   Ninth Workshop on Statistical Machine
  2010 Annual Conference of the North
                                                   Translation, pages 426–436, Baltimore,
  American Chapter of the ACL, pages
                                                   Maryland, USA, June. ACL.
  412–419. ACL.
                                                Roser Saurı́. 2013. Corpus de Dominio
Brendan O’Connor and Michel Krieger
                                                  Genérico y Especı́ficos (Inglés, Español,
  and David Ahn. 2010. TweetMotif:
                                                  Catalán y Portugués). Technical report,
  Exploratory   Search   and     Topic
                                                  Social Media. Métodos y Tecnologı́as para
  Summarization for Twitter.         In
                                                  los medios sociales. Programa CENIT
  Proceedings  of    the  International
                                                  2010 (CEN-20101037).
  Conference on Web and Social Media
  (ICWSM). The AAAI Press.                      S̆ubert, Eduard and Ondr̆ej Bojar. 2014.
Fabrizio Gotti and Phillippe Langlais              Twitter crowd translation – design and
  and Atefeh Farzindar. 2014. Hashtag              objectives.     In Translating and the
  Occurrences, Layout and Translation:             Computer 36, pages 217–227, Geneva,
  A Corpus-driven Analysis of Tweets               Switzerland. AsLing, The International
  Published by the Canadian Government.            Association for Advancement in Language
  In Proceedings of the Ninth International        Technology, Editions Tradulex; AsLing.
  Conference on Language Resources
  and Evaluation (LREC’14), Reykjavik,
  Iceland, may. ELRA.
Formiga, Lluı́s and Fonollosa, José A.
   R. 2012. Dealing with Input Noise
   in Statistical Machine Translation. In
   Proceedings of COLING 2012: Posters,
   pages 319–328, Mumbai, India, December.
Gotti, Fabrizio and Langlais, Philippe and
  Farzindar, Atefeh. 2013. Translating
  Government Agencies’ Tweet Feeds:
  Specificities, Problems and (a few)