=Paper= {{Paper |id=Vol-2624/germeval-task2-paper3 |storemode=property |title=Detecting Noisy Swiss German Web Text Using RNN- and Rule-Based Techniques |pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task2-paper3.pdf |volume=Vol-2624 |authors=Janis Goldzycher,Jonathan Schaber |dblpUrl=https://dblp.org/rec/conf/swisstext/GoldzycherS20 }} ==Detecting Noisy Swiss German Web Text Using RNN- and Rule-Based Techniques== https://ceur-ws.org/Vol-2624/germeval-task2-paper3.pdf
                           Detecting Noisy Swiss German Web Text
                           Using RNN- and Rule-Based Techniques

                     Janis Goldzycher∗                                  Jonathan Schaber∗
          Institute of Computational Linguistics              Institute of Computational Linguistics
                    University of Zurich                                University of Zurich
            janis.goldzycher@uzh.ch                             jonathan.schaber@uzh.ch



                       Abstract                               guage identification and uses tweets as test data, it
                                                              combines all of these difficulties.
    This paper presents the system we sub-                       Previous approaches based on classical machine
    mitted to the Swiss German language de-                   learning typically utilize character level features
    tection shared task, part of the GermEval                 like single characters, character combinations (n-
    2020 Campaign, held at the SwissText &                    grams) and capitalization together with models
    KONVENS 2020 conference. The goal of                      such as naive bayes classifiers, support vector ma-
    the task is to identify if a given text snip-             chines, and decision trees (Gamallo et al., 2014;
    pet is written in Swiss German. Our ap-                   Hanif et al., 2007; Kumar et al., 2015; Porta, 2014;
    proach includes a reformulation of a bi-                  Zubiaga et al., 2014). There have been both CNN-
    nary to a multi-way classification prob-                  based (Jaech et al., 2016a,b; Li et al., 2018) and
    lem, a character filter, a neural RNN-                    RNN-based (Jurgens et al., 2017; Kocmi and Bo-
    based classifier, and the addition of syn-                jar, 2017) neural approaches to language identi-
    thetic noise to the training set. The of-                 fication using character embeddings as represen-
    ficial evaluation of our submitted system                 tations, sometimes with additional features incor-
    results in an F1 score of 96.8%, achieving                porated, like n-grams (Chang and Lin, 2014) or
    the second place in this shared task.                     word embeddings (Samih et al., 2016). We ap-
                                                              proach this problem using a bidirectional GRU
                                                              (BiGRU) architecture similar to the one put for-
1   Introduction
                                                              ward by Kocmi and Bojar (2017).
In this paper we describe our approach and results               In this paper we describe our system, compris-
for the Swiss German language detection shared                ing: (1) a reformulation of a binary to a multi-way
task (GSWID 2020) at the SwissText & KON-                     classification problem, (2) a BiGRU-based neural
VENS conference. The objective of the shared                  architecture, (3) a character-based filter, and (4) a
task is to construct a system to automatically iden-          noisifier module.
tify Swiss German (GSW) text snippets.
    Generally, language identification has been               2    Data
viewed as a solved problem “suitable for under-               Provided Data The shared task organizers pro-
graduate instruction”, as McNamee (2005) depre-               vide a list of approximately 2,000 GSW tweets to
catingly remarks in the title of his paper. However,          be used as positive training examples.1 The use of
it is not clear if this view holds true for text snip-        further training material is explicitly allowed and
pets that are (1) short, (2) noisy, (3) from multi-           encouraged. In the following paragraphs we give
ple domains, (4) written in a scarce resource lan-            a review of additionally collected data.
guage, or (5) which consist of non-standardized
dialects (Gamallo et al., 2014; Jauhiainen et al.,            Swiss German We collect GSW data from the
2019). Since this shared task is about GSW lan-               following sources: the NOAH corpus (Hollenstein
    ∗
     Equal Contribution.                                         1
                                                                   Due to the distribution regulations of Twitter, the orga-
Copyright c 2020 for this paper by its authors. Use permit-   nizers published only tweet IDs. At the time of downloading,
ted under Creative Commons License Attribution 4.0 Interna-   22 of these tweets were not available anymore, so the actual
tional (CC BY 4.0)                                            number of tweets we are able to use is 1,978.
and Aepli, 2014), a collection of texts from vari-                    Language              # instances    relative
                                                                      Swiss German (GSW)       780,502     19.17%
ous genres; the Swisscrawl corpus (Linder et al.,                     Standard German          568,493     13.97%
2019), which consists of user entries from forums                     English                  304,822      7.49%
and social media; the chatmania data from the                         Italian                  300,077      7.37%
                                                                      Dutch                    300,043      7.37%
SpinningBytes corpus (Grubenmann et al., 2018),                       Swedish                  300,002      7.37%
containing forum entries; and the GSW corpus                          Luxembourgish            300,000      7.37%
                                                                      Norwegian                300,000      7.37%
from the corpus collection of the University of                       French                   299,017      7.35%
Leipzig (Goldhahn et al., 2012), that also incor-                     Low German               100,000      2.46%
porates web data, mainly from chat forums.                            West Frisian             100,000      2.46%
                                                                      Portuguese               100,000      2.46%
Other Languages There is of course an abun-                           Romanian                 100,000      2.46%
                                                                      Tagalog                  100,000      2.46%
dant amount of textual data in a multitude of other                   Bavarian                  30,000      0.74%
languages which cannot be entirely considered, or                     Lombard                   30,000      0.74%
feasibly be included in a training set. We devise                     Yiddish                   30,000      0.74%
                                                                      Croatian                  10,001      0.25%
the following difficulty-scale from A (easy) to D                     Northern Frisian          10,000      0.25%
(difficult) as a prioritization guideline as for which                Other                       8,060     0.20%
                                                                      Total                  4,071,017    100.00%
languages we presume are hard to distinguish from
GSW and thus most important to include in the                    Table 1: Overview of collected text snippets per lan-
training set as negative examples:                               guage. Languages with less than 1,000 examples, e.g.
                                                                 Turkish, are subsumed under class other.
A: languages written in non-GSW character sets2
    (e.g. Chinese, Hindi, Arabic)                                Noise Through manual inspection of the tweets
B: languages written in scripts that overlap with                that were provided as training data we observed
    the GSW character set (e.g. Afrikaans, Taga-                 that they are significantly noisier than the rest of
    log, English, Tok Pisin)                                     our training data.
C: languages in B that share parts of the lexi-                     We identify two kinds of noise in this data:
    con with GSW (e.g. English, Italian, French,                 token-level noise and character level noise. Both
    Standard German)                                             can be produced on purpose or by accident.
D: languages and varieties in C that are closely                 Token-level noise consists of words, phrases or
    related to GSW (e.g. Standard German,                        citations in other languages, mainly English or
    Dutch, English, Bavarian)                                    Standard German, in otherwise GSW tweets.
                                                                 Character-level noise consists of omissions, inser-
   Note that the following set memberships hold:                 tions or repetitions of single characters. Examples
B ⊃ C ⊃ D and A ∩ B = ∅.                                         can be found in Appendix B.
   We only collect languages from B, with special
focus on C and D, since text snippets written in a               3   Method
language from A can be filtered out in a rule-based
manner.                                                          Task Formalization We formalize the task as
   We collect data for all languages from the afore-             follows: Assign a label y ∈ {0, 1} to an input
mentioned corpus collection of the University of                 sequence of characters x = {x0 , x1 , x2 , ..., xn },
Leipzig. For Standard German, we additionally                    where 0 corresponds to the class swiss german and
gather texts from the Hamburg Dependency Tree-                   1 corresponds to the class not-swiss german.
Bank (Foth et al., 2014). For all corpora that are                  However, the not-swiss german class is a very
not comprised of tweet-like text, we treat each sen-             broad category since it not only contains all other
tence as an individual text snippet. An overview                 languages, some of which are similar to GSW, but
of our collected data is shown in table 1. We split              also all possible string sequences that do not ap-
our data set with a ratio of 0.95/0.05 resulting in a            pear in GSW. Thus, we hypothesize that more fine-
training set containing 3,605,283 instances and a                grained labels will lead to more homogeneous and
development set of 189,752 instances.                            better separable classes.
                                                                    Following this line of reasoning, we define three
    2
      We define the GSW character set as the set of characters   different granularity levels: binary, ternary and
found on a GSW keyboard. This differs slightly from e.g. a
Standard German Keyboard, which lacks characters like “è”,      fine-grained. The binary setting corresponds to the
“à” and “é”.                                                   task formalization described above. In the ternary
                                                         quency surpasses a given threshold, x is labeled as
                                                         not-swiss german.

                                                         Neural Model Our neural model comprises
                                                         character embeddings, a BiGRU (Cho et al.,
                                                         2014), two blocks of dense layers and a final dense
                                                         layer. The BiGRU takes the embedded characters
                                                         as input and produces the outputs →     −
                                                                                                 o0 , ..., −
                                                                                                           o→
                                                                                                            n and
                                                                                   −→
                                                         also a last hidden state hn for the forward GRU.
                                                                                                            ←−
                                                         For the backward GRU we get ←    o−      ←−
                                                                                           n , ... o0 and h0 re-
                                                         spectively.
                                                            We ignore all BiGRU outputs and only use the
                                                                            −→      ←−
                                                         last hidden states hn and h0 .3
                                                            Each hidden state is fed into a block of two
                                                         dense layers with dropout before both layers and
                                                         the rectified unit linear function in between. The
                                                         outputs of the two dense blocks z1 and z2 are con-
Figure 1: Neural Architecture based on character em-     catenated and fed into a final dense layer with the
beddings, a BiGRU, two dense blocks and a dense          number of classes as the output dimension. We
layer.                                                   apply a log-softmax function to the output to turn
                                                         the neural activations into a probability distribu-
setting we split the class not-swiss german into the     tion over the target classes. Note that the number
classes standard german and other. And in the            of target classes depends on the chosen level of
fine-grained setting, each language present in ta-       granularity.
ble 1 corresponds to one class, with an additional          For optimization we use the negative log like-
class other. For our collected data set, this leads to   lihood loss combined with the Adam optimizer
a total of 23 classes.                                   (Kingma and Ba, 2014). We initialize the char-
                                                         acter embeddings randomly and train them jointly
Pipeline We construct a pipeline where an in-            with the rest of the model.
coming text snippet is first cleaned of hashtags,
mentions and URLs. Then a rule-based charac-             Noisifier Based on the assumption that the test
ter filter decides if the text snippet is a member of    data has a similar amount of noise as the tweets
A and if so, immediately classifies the text snippet     provided for training, we introduce a noisifier with
as not-swiss german. If the text snippet is not part     the goal of injecting this type of noise into the en-
of A, it might be an instance of swiss german and        tire training data, which contains large amounts
hence is clipped to a prespecified length, which we      of text snippets from “clean” resources like news
treat as a hyper parameter, and fed into a neural        texts. We refer to this difference in noise between
classifier.                                              corpora as noisiness gap. Recall that we observed
    During training time, we make two modifica-          token-level and character-level noise in the train-
tions to the pipeline: (1) The rule-based character      ing data in section 2. In what follows, we will
filter is left out because our data only consists of     address both types of noise separately.
text snippets from languages in B. (2) We make              For the token level noise we created a hand-
use of an additional noisifier, which adds noise         crafted list L consisting of English and Standard
specifically modeled after the noise that is actually    German words often found in GSW tweets, com-
encountered in GSW text snippets on the web. In          ments and messages. Additionally, we add men-
the rest of this section we describe the main parts      tions of Swiss locations to L.4
of the pipeline in detail.                                  3
                                                               In earlier experiments we also used the BiGRU outputs
                                                         by concatenating them with the last hidden states and then
Character-Based Filter For a given sequence              fed this entire feature vector into dense layers. However, we
of characters x, the character-based filter com-         found that using these outputs decreased performance.
                                                             4
                                                               We try to avoid that the model learns to associate Swiss
putes the relative frequency of characters in x that     location names with GSW text which presumably would lead
do not appear in the GSW character set. If this fre-     to false positives.
    Configuration & Training                      Development Set                               Test Set
  Emb Clip G N             TT    Prec      Rec      Acc    F1       AccT    AccF    Prec      Rec     Acc      F1
  100    280     b   f    15.7   0.946    0.926    0.976 0.936          -       -   0.898    0.920 0.911      0.909
  100    280     t   f    15.9   0.971    0.957    0.986 0.964      0.980       -   0.905    0.942 0.924      0.923
  100    280     f   f    15.8   0.994    0.992    0.997 0.993      0.989   0.994   0.907    0.990 0.946      0.947
  100    100     b   f     6.8   0.987    0.980    0.994 0.983          -       -   0.872    0.880 0.880      0.876
  100    100     t   f     6.6   0.982    0.973    0.992 0.978      0.988       -   0.932    0.954 0.944      0.943
  100    100     f   f     6.5   0.991    0.989    0.996 0.990      0.988   0.991   0.930    0.984 0.957      0.956
  300    100     b   f     7.0   0.987    0.977    0.993 0.981          -       -   0.949    0.931 0.943      0.940
  300    100     t   f     7.1   0.992    0.988    0.996 0.990      0.995       -   0.959    0.948 0.955      0.953
  300    100     f   f     7.1   0.992    0.989    0.996 0.990      0.988   0.992   0.927    0.985 0.956      0.955
  300    100     b   t     8.4   0.993    0.987    0.996 0.991          -       -   0.955    0.980 0.968      0.967
  300    100     t   t     7.8   0.993    0.986    0.996 0.990      0.995       -   0.947    0.983 0.965      0.965
  300    100     f   t     7.8   0.994    0.987    0.997 0.991      0.988   0.992   0.945    0.993 0.969      0.968

Table 2: Results on the development and test set. Abbrevations: Embedding dimension, Clipped after m characters,
Granularity (binary, ternary, finegrained), Noise injected (true, false), Training Time in hours. The last row shows
the configuration that we submitted to the shared task.

   The token-level noisifier receives as input a            4   Results and Discussion
clean training example x consisting of k tokens
                                                            Our submitted model achieves an F1 score of
and the two thresholds p1 ∈ [0, 1] and p2 ∈ [0, 1)
                                                            96.8% in the official evaluation on the test set, re-
with p1 > p2 . For each token in x, a noise to-
                                                            sulting in a second place, 1.4% behind the best
ken l ∈ L is inserted with a probability of 1 − p1 .
                                                            model.
We hypothesize that the presence of one noise to-
                                                               Table 2 gives an overview of different hyper-
ken increases the probability of additional noise
                                                            parameter settings with the corresponding results
tokens. To model this, we use a higher second
                                                            on the development and test set.5 We report
probability 1 − p2 for repeatedly adding an addi-
                                                            the following observations: (1) More fine-grained
tional noise token. We define an upper bound of
                                                            classes generally lead to better results. (2) There
k/2 for the number of inserted noise tokens c un-
                                                            is a strong performance drop from development to
der the assumption that a text snippet with c ≥ k/2
                                                            test set supporting our noisiness gap assumption.
does not resemble the original language of x any-
                                                            (3) Injecting noise alleviates this drop and, com-
more. See algorithm 1 for more details.
                                                            pared to the same configurations without noise,
                                                            leads to relative performance increases ranging
   The algorithm inserting character level noise re-        from 1.2% to 2.7% F1 score on the test set.
ceives as input a token-level noisified training ex-           (4) Increasing embedding dimensionality leads
ample xT noise and analogous to token-level noise           to more stable results over different granularities.
injection, the two thresholds p3 ∈ [0, 1] and               (5) Clipping after 100 characters leads to a bisec-
p4 ∈ [0, 1) with p3 > p4 . Additionally, the al-            tion of training time while on average upholding
gorithm receives a character set C, consisting of           performance.
alphanumeric and punctuation characters from the               Since the test set does not contain languages
Latin 1 character set. At each character in xT noise        from A the character-based filter is rarely triggered
character-level noise is injected with a probability        and its impact on performance is negligible. How-
of 1−p3 . The noise consists of either character in-        ever, the filter might be important when detecting
sertion, omission, or repetition. All three types of        GSW text in settings where languages in A occur
noise are equally likely to happen. We hypothesize          more frequently. More information about hyper-
that the presence of character-level noise makes            parameters and hardware is given in Appendix C.
more such noise likelier. Thus, in case of insertion
or repetition, we repeatedly add additional noise           5   Conclusion
characters with a probability of 1 − p4 . See algo-
                                                            This paper described our submission to the
rithm 2 for more details.
                                                            GSWID 2020 shared task. We introduced a
                                                            BiGRU-based architecture, a character-based fil-
   Our implementation of the approach described             ter and a noisifier module. Our evaluation results
in this section using PyTorch will be published at            5
                                                                The test set evaluation relies on gold labels that were
https://github.com/JonathanSchaber/shared task.             made available after the submission deadline.
show that more fine-grained classes and adding             character-word models for language identification.
noise to the training data leads to performance in-        arXiv preprint arXiv:1608.03030.
creases. Further investigations will concern pre-        Aaron Jaech, George Mulcaire, Mari Ostendorf, and
training, transformer-based architectures, and a           Noah A Smith. 2016b. A neural model for language
more sophisticated noisifier.                              identification in code-switched tweets. In Proceed-
                                                           ings of The Second Workshop on Computational Ap-
Acknowledgments Above all, we would like                   proaches to Code Switching, pages 60–64.
to thank Simon Clematide who supervised this             Tommi Sakari Jauhiainen, Marco Lui, Marcos
project and suggested more fine-grained classes.           Zampieri, Timothy Baldwin, and Krister Lindén.
Further, we thank the shared task organizers, es-          2019. Automatic language identification in texts: A
                                                           survey. Journal of Artificial Intelligence Research,
pecially Pius von Däniken who clarified our ques-
                                                           65:675–782.
tions, and also our proofreaders and reviewers.
                                                         David Jurgens, Yulia Tsvetkov, and Dan Jurafsky.
                                                           2017. Incorporating dialectal variability for socially
References                                                 equitable language identification. In Proceedings of
Joseph Chee Chang and Chu-Cheng Lin. 2014.                 the 55th Annual Meeting of the Association for Com-
   Recurrent-neural-network for language detection         putational Linguistics (Volume 2: Short Papers),
   on twitter code-switching corpus. arXiv preprint        pages 51–57.
   arXiv:1412.4314.                                      Diederik P Kingma and Jimmy Ba. 2014. Adam: A
                                                           method for stochastic optimization. arXiv preprint
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
                                                           arXiv:1412.6980.
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
  Schwenk, and Yoshua Bengio. 2014. Learning             Tom Kocmi and Ondřej Bojar. 2017. Lanidenn: Multi-
  phrase representations using rnn encoder-decoder         lingual language identification on character window.
  for statistical machine translation. arXiv preprint      arXiv preprint arXiv:1701.03338.
  arXiv:1406.1078.
                                                         Rahul Venkatesh Kumar, M Anand Kumar, and KP So-
Kilian Foth, Arne Köhn, Niels Beuck, and Wolfgang         man. 2015. Amritacen nlp@ fire 2015 language
  Menzel. 2014. Because size does matter: The ham-         identification for indian languages in social media
  burg dependency treebank.                                text. In FIRE workshops, pages 26–28.

Pablo Gamallo, Marcos Garcia, Susana Sotelo, and         Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018.
  José Ramom Pichel Campos. 2014. Comparing                What’s in a domain? learning domain-robust text
  ranking-based and naive bayes approaches to lan-          representations using adversarial training. arXiv
  guage detection on tweets. In TweetLID@ SEPLN,            preprint arXiv:1805.06088.
  pages 12–16.
                                                         Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.           Musat, and Andreas Fischer. 2019. Automatic cre-
  2012. Building large monolingual dictionaries at         ation of text corpora for low-resource languages
  the leipzig corpora collection: From 100 to 200 lan-     from the internet: The case of swiss german. arXiv
  guages. In LREC, volume 29, pages 31–43.                 preprint arXiv:1912.00159.
                                                         Paul McNamee. 2005. Language identification: a
Ralf Grubenmann, Don Tuggener, Pius Von Däniken,
                                                           solved problem suitable for undergraduate instruc-
  Jan Deriu, and Mark Cieliebak. 2018. SB-CH: A
                                                           tion. Journal of Computing Sciences in Colleges,
  Swiss German Corpus with Sentiment Annotations.
                                                           20(3):94–101.
  In Proceedings of the Eleventh International Confer-
  ence on Language Resources and Evaluation (LREC        Jordi Porta. 2014. Twitter language identification using
  2018), Miyazaki, Japan. European Language Re-             rational kernels and its potential application to soci-
  sources Association (ELRA).                               olinguistics. In TweetLID@ SEPLN, pages 17–20.
Farheen Hanif, Fouzia Latif, and M Sikandar Hayat        Younes Samih, Suraj Maharjan, Mohammed Attia,
  Khiyal. 2007. Unicode aided language identifica-         Laura Kallmeyer, and Thamar Solorio. 2016. Multi-
  tion across multiple scripts and heterogeneous data.     lingual code-switching identification via lstm recur-
  Information Technology Journal, 6(4):534–540.            rent neural networks. In Proceedings of the Second
                                                           Workshop on Computational Approaches to Code
Nora Hollenstein and Noëmi Aepli. 2014. Compilation       Switching, pages 50–59.
  of a swiss german dialect corpus and its application
  to pos tagging. In Proceedings of the First Work-      Arkaitz Zubiaga, Inaki San Vicente, Pablo Gamallo,
  shop on Applying NLP Tools to Similar Languages,         José Ramom Pichel Campos, Iñaki Alegrı́a Loinaz,
  Varieties and Dialects, pages 85–94.                     Nora Aranberri, Aitzol Ezeiza, and Vı́ctor Fresno-
                                                           Fernández. 2014. Overview of tweetlid: Tweet lan-
Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari          guage identification at sepln 2014. In TweetLID@
  Ostendorf, and Noah A Smith. 2016a. Hierarchical         SEPLN, pages 1–11.
A      Neural Architecture                                                     Algorithm 1 Token-level noise injection
                                                                                  Input: x, p1 , p2 , L
We formally define our architecture as follows:                                   Output: xT noise
Let E(xi ) denote a function that returns the em-                                t ← split x into tokens
                                                                                 initialize array u
bedding for a given character xi ∈ x. The last                                   for each tj ∈ t do
              −
              → ←  −
hidden states hn , h0 are given by                                                   if r1 ← r() > p1 then
                                                                                         add randomly chosen token l ∈ L to u
                                                                                         while r2 ← r() > p2 ∧ c < k/2 do
    −
    → ←  − −                                                                                add randomly chosen token l ∈ L to u
    hn , h0 , →
              o0 , ..., −
                        o→   ←
                             −        ←
                                      −
                         n , on , ... o0 = BiGRU(E(x0 ), ..., E(xn )).   (1)             end while
                                                                                     end if
  We feed each last hidden state into a block of                                     append tj to u
                                                                                 end for
dense layers defined as                                                          return concatenate u to string xT noise

 f block (v) = W2T ∗ dr(ReLU(W1T ∗ dr(v) + b1 )) + b2 (2)
                                                                               Algorithm 2 Character-level noise injection
where W1 ∈ R300×150 and W2 ∈ R150×50 denote                                       Input: xT noise , p3 , p4 , C, A
the weight matrices of the block’s first and sec-                                 Output: xCnoise
                                                                                 initialize empty string xCnoise
ond layer, dr denotes a dropout function, b1 and                                 for each xi ∈ xT noise do
b2 denote learnable biases, and ReLU denotes the                                     if r3 ← r() > p3 then
rectified linear unit activation function.                                               a ← choose random action ∈ A
                                      ←−                                                 if a = ’omission’ then
   z1 is computed as z1 = f block (h0 ) and z2 re-                                           continue
                                −
                                →
spectively as z2 = f block (hn ). Note that the                                          else if a = ’insertion’ then
weights of the two dense blocks are not shared,                                              b ← choose random character ∈ C
                                                                                         else if a = ’repetition’ then
but initialized and trained independently. We con-                                           b ← xi
catenate z1 and z2 to z, which is then fed through                                       end if
                                                                                         add b to xCnoise
a final layer formalized as follows:                                                     while r4 ← r() > p4 do
                                                                                             add b to xCnoise
                                                                                         end while
      f f inal (v) = log-softmax(W T ∗ v + b)                            (3)         end if
                                                                                     add xi to xCnoise
                                                                                 end for
with W ∈ R100×Q where Q is the # target                                          return xCnoise
classes.

B      Noisifier                                                                     clean: Viele Personen sind nicht der Überzeugung.
                                                                                     noisy: Viele Personen sind nicht der Üerzeugunnng.
As examples for data containing token- and
character-level noise, consider the following two                                    clean: Hast du schon die neue xbox 3 gesehen?
made up text snippets.6 Character sequences we                                       noisy: Hast du music schon die neue xbox 3 geese-
                                                                                          hen?
regard as noise are boldfaced.
                                                                                     clean: You’ll never guess what happened this morn-
        Dä bus isch stablibe, mis ticket nüme gültig try-                              ing.
        ing to stay chill                                                            noisy: You’ll never guess Jwhat happened this
                                                                                          morninnng.
        ooohhhh neiiiii mir händs nöd gschafft
                                                                                     clean: Le tigre est un grand chat de proie originaire
   In the following algorithms, r() denotes a func-                                       d’Asie.
tion which returns a random value ∈ [0, 1].                                          noisy: Le tigre estt un grand chatde proie originaire
                                                                                          d’Asie.
   In algorithm 2 parameter A contains
{’omission’, ’insertion’, ’repetition’}.                                             clean: C’è ancora una mancanza di chiarezza, non
                                                                                          possiamo farci nulla.
   For a given example input our noisifier with pa-
                                                                                     noisy: C’è ancor una mancanza di chiarezza, non pos-
rameters settings as shown in the hyper parame-                                           siamo St. Moritz Frisör farci nulla.
ter table in Appendix C introduces noise structures
into non-noisy texts, like the following:                                         As is obvious from these examples, the noise
                                                                               injected by the noisifier still looks quite different
   6
     For copyright reasons we do not cite or display real                      from human created noise, thus a more sophisti-
tweets in the publication.                                                     cated noisifier is desirable.
C      Configurations
Table 2 only lists parameters which are changed
during ablation testing. In the table below, we re-
port the parameters we left unchanged during ab-
lation testing.
    Parameter                                  Value
    hidden size h                                300
    dense-block layer-in size                    300
    dense-block layer-betw. size                 150
    dense-block layer-out size                     50
    final-block layer-in size                    100
    final-block layer-out size     # target classes
    dropout                                       0.1
    learning rate                              0.001
    number of epochs                               15
    p1                                          0.99
    p2                                            0.6
    p3                                          0.97
    p4                                            0.5
    character-filter threshold                    0.8

Table 3: Hyperparameters maintained constant during
all experiments.

   After two epochs the learning rate is decreased
from 0.001 to 0.0001 and after six epochs the
learning rate is further decreased to 0.00003.
   We ran our models on a NVIDIA GeForce GTX
TITAN X graphics processing unit.