=Paper=
{{Paper
|id=Vol-2624/germeval-task2-paper3
|storemode=property
|title=Detecting Noisy Swiss German Web Text Using RNN- and Rule-Based Techniques
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task2-paper3.pdf
|volume=Vol-2624
|authors=Janis Goldzycher,Jonathan Schaber
|dblpUrl=https://dblp.org/rec/conf/swisstext/GoldzycherS20
}}
==Detecting Noisy Swiss German Web Text Using RNN- and Rule-Based Techniques==
Detecting Noisy Swiss German Web Text Using RNN- and Rule-Based Techniques Janis Goldzycher∗ Jonathan Schaber∗ Institute of Computational Linguistics Institute of Computational Linguistics University of Zurich University of Zurich janis.goldzycher@uzh.ch jonathan.schaber@uzh.ch Abstract guage identification and uses tweets as test data, it combines all of these difficulties. This paper presents the system we sub- Previous approaches based on classical machine mitted to the Swiss German language de- learning typically utilize character level features tection shared task, part of the GermEval like single characters, character combinations (n- 2020 Campaign, held at the SwissText & grams) and capitalization together with models KONVENS 2020 conference. The goal of such as naive bayes classifiers, support vector ma- the task is to identify if a given text snip- chines, and decision trees (Gamallo et al., 2014; pet is written in Swiss German. Our ap- Hanif et al., 2007; Kumar et al., 2015; Porta, 2014; proach includes a reformulation of a bi- Zubiaga et al., 2014). There have been both CNN- nary to a multi-way classification prob- based (Jaech et al., 2016a,b; Li et al., 2018) and lem, a character filter, a neural RNN- RNN-based (Jurgens et al., 2017; Kocmi and Bo- based classifier, and the addition of syn- jar, 2017) neural approaches to language identi- thetic noise to the training set. The of- fication using character embeddings as represen- ficial evaluation of our submitted system tations, sometimes with additional features incor- results in an F1 score of 96.8%, achieving porated, like n-grams (Chang and Lin, 2014) or the second place in this shared task. word embeddings (Samih et al., 2016). We ap- proach this problem using a bidirectional GRU (BiGRU) architecture similar to the one put for- 1 Introduction ward by Kocmi and Bojar (2017). In this paper we describe our approach and results In this paper we describe our system, compris- for the Swiss German language detection shared ing: (1) a reformulation of a binary to a multi-way task (GSWID 2020) at the SwissText & KON- classification problem, (2) a BiGRU-based neural VENS conference. The objective of the shared architecture, (3) a character-based filter, and (4) a task is to construct a system to automatically iden- noisifier module. tify Swiss German (GSW) text snippets. Generally, language identification has been 2 Data viewed as a solved problem “suitable for under- Provided Data The shared task organizers pro- graduate instruction”, as McNamee (2005) depre- vide a list of approximately 2,000 GSW tweets to catingly remarks in the title of his paper. However, be used as positive training examples.1 The use of it is not clear if this view holds true for text snip- further training material is explicitly allowed and pets that are (1) short, (2) noisy, (3) from multi- encouraged. In the following paragraphs we give ple domains, (4) written in a scarce resource lan- a review of additionally collected data. guage, or (5) which consist of non-standardized dialects (Gamallo et al., 2014; Jauhiainen et al., Swiss German We collect GSW data from the 2019). Since this shared task is about GSW lan- following sources: the NOAH corpus (Hollenstein ∗ Equal Contribution. 1 Due to the distribution regulations of Twitter, the orga- Copyright c 2020 for this paper by its authors. Use permit- nizers published only tweet IDs. At the time of downloading, ted under Creative Commons License Attribution 4.0 Interna- 22 of these tweets were not available anymore, so the actual tional (CC BY 4.0) number of tweets we are able to use is 1,978. and Aepli, 2014), a collection of texts from vari- Language # instances relative Swiss German (GSW) 780,502 19.17% ous genres; the Swisscrawl corpus (Linder et al., Standard German 568,493 13.97% 2019), which consists of user entries from forums English 304,822 7.49% and social media; the chatmania data from the Italian 300,077 7.37% Dutch 300,043 7.37% SpinningBytes corpus (Grubenmann et al., 2018), Swedish 300,002 7.37% containing forum entries; and the GSW corpus Luxembourgish 300,000 7.37% Norwegian 300,000 7.37% from the corpus collection of the University of French 299,017 7.35% Leipzig (Goldhahn et al., 2012), that also incor- Low German 100,000 2.46% porates web data, mainly from chat forums. West Frisian 100,000 2.46% Portuguese 100,000 2.46% Other Languages There is of course an abun- Romanian 100,000 2.46% Tagalog 100,000 2.46% dant amount of textual data in a multitude of other Bavarian 30,000 0.74% languages which cannot be entirely considered, or Lombard 30,000 0.74% feasibly be included in a training set. We devise Yiddish 30,000 0.74% Croatian 10,001 0.25% the following difficulty-scale from A (easy) to D Northern Frisian 10,000 0.25% (difficult) as a prioritization guideline as for which Other 8,060 0.20% Total 4,071,017 100.00% languages we presume are hard to distinguish from GSW and thus most important to include in the Table 1: Overview of collected text snippets per lan- training set as negative examples: guage. Languages with less than 1,000 examples, e.g. Turkish, are subsumed under class other. A: languages written in non-GSW character sets2 (e.g. Chinese, Hindi, Arabic) Noise Through manual inspection of the tweets B: languages written in scripts that overlap with that were provided as training data we observed the GSW character set (e.g. Afrikaans, Taga- that they are significantly noisier than the rest of log, English, Tok Pisin) our training data. C: languages in B that share parts of the lexi- We identify two kinds of noise in this data: con with GSW (e.g. English, Italian, French, token-level noise and character level noise. Both Standard German) can be produced on purpose or by accident. D: languages and varieties in C that are closely Token-level noise consists of words, phrases or related to GSW (e.g. Standard German, citations in other languages, mainly English or Dutch, English, Bavarian) Standard German, in otherwise GSW tweets. Character-level noise consists of omissions, inser- Note that the following set memberships hold: tions or repetitions of single characters. Examples B ⊃ C ⊃ D and A ∩ B = ∅. can be found in Appendix B. We only collect languages from B, with special focus on C and D, since text snippets written in a 3 Method language from A can be filtered out in a rule-based manner. Task Formalization We formalize the task as We collect data for all languages from the afore- follows: Assign a label y ∈ {0, 1} to an input mentioned corpus collection of the University of sequence of characters x = {x0 , x1 , x2 , ..., xn }, Leipzig. For Standard German, we additionally where 0 corresponds to the class swiss german and gather texts from the Hamburg Dependency Tree- 1 corresponds to the class not-swiss german. Bank (Foth et al., 2014). For all corpora that are However, the not-swiss german class is a very not comprised of tweet-like text, we treat each sen- broad category since it not only contains all other tence as an individual text snippet. An overview languages, some of which are similar to GSW, but of our collected data is shown in table 1. We split also all possible string sequences that do not ap- our data set with a ratio of 0.95/0.05 resulting in a pear in GSW. Thus, we hypothesize that more fine- training set containing 3,605,283 instances and a grained labels will lead to more homogeneous and development set of 189,752 instances. better separable classes. Following this line of reasoning, we define three 2 We define the GSW character set as the set of characters different granularity levels: binary, ternary and found on a GSW keyboard. This differs slightly from e.g. a Standard German Keyboard, which lacks characters like “è”, fine-grained. The binary setting corresponds to the “à” and “é”. task formalization described above. In the ternary quency surpasses a given threshold, x is labeled as not-swiss german. Neural Model Our neural model comprises character embeddings, a BiGRU (Cho et al., 2014), two blocks of dense layers and a final dense layer. The BiGRU takes the embedded characters as input and produces the outputs → − o0 , ..., − o→ n and −→ also a last hidden state hn for the forward GRU. ←− For the backward GRU we get ← o− ←− n , ... o0 and h0 re- spectively. We ignore all BiGRU outputs and only use the −→ ←− last hidden states hn and h0 .3 Each hidden state is fed into a block of two dense layers with dropout before both layers and the rectified unit linear function in between. The outputs of the two dense blocks z1 and z2 are con- Figure 1: Neural Architecture based on character em- catenated and fed into a final dense layer with the beddings, a BiGRU, two dense blocks and a dense number of classes as the output dimension. We layer. apply a log-softmax function to the output to turn the neural activations into a probability distribu- setting we split the class not-swiss german into the tion over the target classes. Note that the number classes standard german and other. And in the of target classes depends on the chosen level of fine-grained setting, each language present in ta- granularity. ble 1 corresponds to one class, with an additional For optimization we use the negative log like- class other. For our collected data set, this leads to lihood loss combined with the Adam optimizer a total of 23 classes. (Kingma and Ba, 2014). We initialize the char- acter embeddings randomly and train them jointly Pipeline We construct a pipeline where an in- with the rest of the model. coming text snippet is first cleaned of hashtags, mentions and URLs. Then a rule-based charac- Noisifier Based on the assumption that the test ter filter decides if the text snippet is a member of data has a similar amount of noise as the tweets A and if so, immediately classifies the text snippet provided for training, we introduce a noisifier with as not-swiss german. If the text snippet is not part the goal of injecting this type of noise into the en- of A, it might be an instance of swiss german and tire training data, which contains large amounts hence is clipped to a prespecified length, which we of text snippets from “clean” resources like news treat as a hyper parameter, and fed into a neural texts. We refer to this difference in noise between classifier. corpora as noisiness gap. Recall that we observed During training time, we make two modifica- token-level and character-level noise in the train- tions to the pipeline: (1) The rule-based character ing data in section 2. In what follows, we will filter is left out because our data only consists of address both types of noise separately. text snippets from languages in B. (2) We make For the token level noise we created a hand- use of an additional noisifier, which adds noise crafted list L consisting of English and Standard specifically modeled after the noise that is actually German words often found in GSW tweets, com- encountered in GSW text snippets on the web. In ments and messages. Additionally, we add men- the rest of this section we describe the main parts tions of Swiss locations to L.4 of the pipeline in detail. 3 In earlier experiments we also used the BiGRU outputs by concatenating them with the last hidden states and then Character-Based Filter For a given sequence fed this entire feature vector into dense layers. However, we of characters x, the character-based filter com- found that using these outputs decreased performance. 4 We try to avoid that the model learns to associate Swiss putes the relative frequency of characters in x that location names with GSW text which presumably would lead do not appear in the GSW character set. If this fre- to false positives. Configuration & Training Development Set Test Set Emb Clip G N TT Prec Rec Acc F1 AccT AccF Prec Rec Acc F1 100 280 b f 15.7 0.946 0.926 0.976 0.936 - - 0.898 0.920 0.911 0.909 100 280 t f 15.9 0.971 0.957 0.986 0.964 0.980 - 0.905 0.942 0.924 0.923 100 280 f f 15.8 0.994 0.992 0.997 0.993 0.989 0.994 0.907 0.990 0.946 0.947 100 100 b f 6.8 0.987 0.980 0.994 0.983 - - 0.872 0.880 0.880 0.876 100 100 t f 6.6 0.982 0.973 0.992 0.978 0.988 - 0.932 0.954 0.944 0.943 100 100 f f 6.5 0.991 0.989 0.996 0.990 0.988 0.991 0.930 0.984 0.957 0.956 300 100 b f 7.0 0.987 0.977 0.993 0.981 - - 0.949 0.931 0.943 0.940 300 100 t f 7.1 0.992 0.988 0.996 0.990 0.995 - 0.959 0.948 0.955 0.953 300 100 f f 7.1 0.992 0.989 0.996 0.990 0.988 0.992 0.927 0.985 0.956 0.955 300 100 b t 8.4 0.993 0.987 0.996 0.991 - - 0.955 0.980 0.968 0.967 300 100 t t 7.8 0.993 0.986 0.996 0.990 0.995 - 0.947 0.983 0.965 0.965 300 100 f t 7.8 0.994 0.987 0.997 0.991 0.988 0.992 0.945 0.993 0.969 0.968 Table 2: Results on the development and test set. Abbrevations: Embedding dimension, Clipped after m characters, Granularity (binary, ternary, finegrained), Noise injected (true, false), Training Time in hours. The last row shows the configuration that we submitted to the shared task. The token-level noisifier receives as input a 4 Results and Discussion clean training example x consisting of k tokens Our submitted model achieves an F1 score of and the two thresholds p1 ∈ [0, 1] and p2 ∈ [0, 1) 96.8% in the official evaluation on the test set, re- with p1 > p2 . For each token in x, a noise to- sulting in a second place, 1.4% behind the best ken l ∈ L is inserted with a probability of 1 − p1 . model. We hypothesize that the presence of one noise to- Table 2 gives an overview of different hyper- ken increases the probability of additional noise parameter settings with the corresponding results tokens. To model this, we use a higher second on the development and test set.5 We report probability 1 − p2 for repeatedly adding an addi- the following observations: (1) More fine-grained tional noise token. We define an upper bound of classes generally lead to better results. (2) There k/2 for the number of inserted noise tokens c un- is a strong performance drop from development to der the assumption that a text snippet with c ≥ k/2 test set supporting our noisiness gap assumption. does not resemble the original language of x any- (3) Injecting noise alleviates this drop and, com- more. See algorithm 1 for more details. pared to the same configurations without noise, leads to relative performance increases ranging The algorithm inserting character level noise re- from 1.2% to 2.7% F1 score on the test set. ceives as input a token-level noisified training ex- (4) Increasing embedding dimensionality leads ample xT noise and analogous to token-level noise to more stable results over different granularities. injection, the two thresholds p3 ∈ [0, 1] and (5) Clipping after 100 characters leads to a bisec- p4 ∈ [0, 1) with p3 > p4 . Additionally, the al- tion of training time while on average upholding gorithm receives a character set C, consisting of performance. alphanumeric and punctuation characters from the Since the test set does not contain languages Latin 1 character set. At each character in xT noise from A the character-based filter is rarely triggered character-level noise is injected with a probability and its impact on performance is negligible. How- of 1−p3 . The noise consists of either character in- ever, the filter might be important when detecting sertion, omission, or repetition. All three types of GSW text in settings where languages in A occur noise are equally likely to happen. We hypothesize more frequently. More information about hyper- that the presence of character-level noise makes parameters and hardware is given in Appendix C. more such noise likelier. Thus, in case of insertion or repetition, we repeatedly add additional noise 5 Conclusion characters with a probability of 1 − p4 . See algo- This paper described our submission to the rithm 2 for more details. GSWID 2020 shared task. We introduced a BiGRU-based architecture, a character-based fil- Our implementation of the approach described ter and a noisifier module. Our evaluation results in this section using PyTorch will be published at 5 The test set evaluation relies on gold labels that were https://github.com/JonathanSchaber/shared task. made available after the submission deadline. show that more fine-grained classes and adding character-word models for language identification. noise to the training data leads to performance in- arXiv preprint arXiv:1608.03030. creases. Further investigations will concern pre- Aaron Jaech, George Mulcaire, Mari Ostendorf, and training, transformer-based architectures, and a Noah A Smith. 2016b. A neural model for language more sophisticated noisifier. identification in code-switched tweets. In Proceed- ings of The Second Workshop on Computational Ap- Acknowledgments Above all, we would like proaches to Code Switching, pages 60–64. to thank Simon Clematide who supervised this Tommi Sakari Jauhiainen, Marco Lui, Marcos project and suggested more fine-grained classes. Zampieri, Timothy Baldwin, and Krister Lindén. Further, we thank the shared task organizers, es- 2019. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, pecially Pius von Däniken who clarified our ques- 65:675–782. tions, and also our proofreaders and reviewers. David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. 2017. Incorporating dialectal variability for socially References equitable language identification. In Proceedings of Joseph Chee Chang and Chu-Cheng Lin. 2014. the 55th Annual Meeting of the Association for Com- Recurrent-neural-network for language detection putational Linguistics (Volume 2: Short Papers), on twitter code-switching corpus. arXiv preprint pages 51–57. arXiv:1412.4314. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- arXiv:1412.6980. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Tom Kocmi and Ondřej Bojar. 2017. Lanidenn: Multi- phrase representations using rnn encoder-decoder lingual language identification on character window. for statistical machine translation. arXiv preprint arXiv preprint arXiv:1701.03338. arXiv:1406.1078. Rahul Venkatesh Kumar, M Anand Kumar, and KP So- Kilian Foth, Arne Köhn, Niels Beuck, and Wolfgang man. 2015. Amritacen nlp@ fire 2015 language Menzel. 2014. Because size does matter: The ham- identification for indian languages in social media burg dependency treebank. text. In FIRE workshops, pages 26–28. Pablo Gamallo, Marcos Garcia, Susana Sotelo, and Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018. José Ramom Pichel Campos. 2014. Comparing What’s in a domain? learning domain-robust text ranking-based and naive bayes approaches to lan- representations using adversarial training. arXiv guage detection on tweets. In TweetLID@ SEPLN, preprint arXiv:1805.06088. pages 12–16. Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Musat, and Andreas Fischer. 2019. Automatic cre- 2012. Building large monolingual dictionaries at ation of text corpora for low-resource languages the leipzig corpora collection: From 100 to 200 lan- from the internet: The case of swiss german. arXiv guages. In LREC, volume 29, pages 31–43. preprint arXiv:1912.00159. Paul McNamee. 2005. Language identification: a Ralf Grubenmann, Don Tuggener, Pius Von Däniken, solved problem suitable for undergraduate instruc- Jan Deriu, and Mark Cieliebak. 2018. SB-CH: A tion. Journal of Computing Sciences in Colleges, Swiss German Corpus with Sentiment Annotations. 20(3):94–101. In Proceedings of the Eleventh International Confer- ence on Language Resources and Evaluation (LREC Jordi Porta. 2014. Twitter language identification using 2018), Miyazaki, Japan. European Language Re- rational kernels and its potential application to soci- sources Association (ELRA). olinguistics. In TweetLID@ SEPLN, pages 17–20. Farheen Hanif, Fouzia Latif, and M Sikandar Hayat Younes Samih, Suraj Maharjan, Mohammed Attia, Khiyal. 2007. Unicode aided language identifica- Laura Kallmeyer, and Thamar Solorio. 2016. Multi- tion across multiple scripts and heterogeneous data. lingual code-switching identification via lstm recur- Information Technology Journal, 6(4):534–540. rent neural networks. In Proceedings of the Second Workshop on Computational Approaches to Code Nora Hollenstein and Noëmi Aepli. 2014. Compilation Switching, pages 50–59. of a swiss german dialect corpus and its application to pos tagging. In Proceedings of the First Work- Arkaitz Zubiaga, Inaki San Vicente, Pablo Gamallo, shop on Applying NLP Tools to Similar Languages, José Ramom Pichel Campos, Iñaki Alegrı́a Loinaz, Varieties and Dialects, pages 85–94. Nora Aranberri, Aitzol Ezeiza, and Vı́ctor Fresno- Fernández. 2014. Overview of tweetlid: Tweet lan- Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari guage identification at sepln 2014. In TweetLID@ Ostendorf, and Noah A Smith. 2016a. Hierarchical SEPLN, pages 1–11. A Neural Architecture Algorithm 1 Token-level noise injection Input: x, p1 , p2 , L We formally define our architecture as follows: Output: xT noise Let E(xi ) denote a function that returns the em- t ← split x into tokens initialize array u bedding for a given character xi ∈ x. The last for each tj ∈ t do − → ← − hidden states hn , h0 are given by if r1 ← r() > p1 then add randomly chosen token l ∈ L to u while r2 ← r() > p2 ∧ c < k/2 do − → ← − − add randomly chosen token l ∈ L to u hn , h0 , → o0 , ..., − o→ ← − ← − n , on , ... o0 = BiGRU(E(x0 ), ..., E(xn )). (1) end while end if We feed each last hidden state into a block of append tj to u end for dense layers defined as return concatenate u to string xT noise f block (v) = W2T ∗ dr(ReLU(W1T ∗ dr(v) + b1 )) + b2 (2) Algorithm 2 Character-level noise injection where W1 ∈ R300×150 and W2 ∈ R150×50 denote Input: xT noise , p3 , p4 , C, A the weight matrices of the block’s first and sec- Output: xCnoise initialize empty string xCnoise ond layer, dr denotes a dropout function, b1 and for each xi ∈ xT noise do b2 denote learnable biases, and ReLU denotes the if r3 ← r() > p3 then rectified linear unit activation function. a ← choose random action ∈ A ←− if a = ’omission’ then z1 is computed as z1 = f block (h0 ) and z2 re- continue − → spectively as z2 = f block (hn ). Note that the else if a = ’insertion’ then weights of the two dense blocks are not shared, b ← choose random character ∈ C else if a = ’repetition’ then but initialized and trained independently. We con- b ← xi catenate z1 and z2 to z, which is then fed through end if add b to xCnoise a final layer formalized as follows: while r4 ← r() > p4 do add b to xCnoise end while f f inal (v) = log-softmax(W T ∗ v + b) (3) end if add xi to xCnoise end for with W ∈ R100×Q where Q is the # target return xCnoise classes. B Noisifier clean: Viele Personen sind nicht der Überzeugung. noisy: Viele Personen sind nicht der Üerzeugunnng. As examples for data containing token- and character-level noise, consider the following two clean: Hast du schon die neue xbox 3 gesehen? made up text snippets.6 Character sequences we noisy: Hast du music schon die neue xbox 3 geese- hen? regard as noise are boldfaced. clean: You’ll never guess what happened this morn- Dä bus isch stablibe, mis ticket nüme gültig try- ing. ing to stay chill noisy: You’ll never guess Jwhat happened this morninnng. ooohhhh neiiiii mir händs nöd gschafft clean: Le tigre est un grand chat de proie originaire In the following algorithms, r() denotes a func- d’Asie. tion which returns a random value ∈ [0, 1]. noisy: Le tigre estt un grand chatde proie originaire d’Asie. In algorithm 2 parameter A contains {’omission’, ’insertion’, ’repetition’}. clean: C’è ancora una mancanza di chiarezza, non possiamo farci nulla. For a given example input our noisifier with pa- noisy: C’è ancor una mancanza di chiarezza, non pos- rameters settings as shown in the hyper parame- siamo St. Moritz Frisör farci nulla. ter table in Appendix C introduces noise structures into non-noisy texts, like the following: As is obvious from these examples, the noise injected by the noisifier still looks quite different 6 For copyright reasons we do not cite or display real from human created noise, thus a more sophisti- tweets in the publication. cated noisifier is desirable. C Configurations Table 2 only lists parameters which are changed during ablation testing. In the table below, we re- port the parameters we left unchanged during ab- lation testing. Parameter Value hidden size h 300 dense-block layer-in size 300 dense-block layer-betw. size 150 dense-block layer-out size 50 final-block layer-in size 100 final-block layer-out size # target classes dropout 0.1 learning rate 0.001 number of epochs 15 p1 0.99 p2 0.6 p3 0.97 p4 0.5 character-filter threshold 0.8 Table 3: Hyperparameters maintained constant during all experiments. After two epochs the learning rate is decreased from 0.001 to 0.0001 and after six epochs the learning rate is further decreased to 0.00003. We ran our models on a NVIDIA GeForce GTX TITAN X graphics processing unit.