=Paper=
{{Paper
|id=Vol-2624/germeval-task2-paper2
|storemode=property
|title=Spoken Dialect Identification in Twitter using a Multi-filter Architecture
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task2-paper2.pdf
|volume=Vol-2624
|authors=Mohammadreza Banaei,Rémi Lebret,Karl Aberer
|dblpUrl=https://dblp.org/rec/conf/swisstext/BanaeiLA20
}}
==Spoken Dialect Identification in Twitter using a Multi-filter Architecture==
<pdf width="1500px">https://ceur-ws.org/Vol-2624/germeval-task2-paper2.pdf</pdf>
<pre>
      Spoken dialect identification in Twitter using a multi-filter architecture


      Mohammadreza Banaei                               Rémi Lebret                             Karl Aberer
                                                      EPFL, Switzerland


                          Abstract                                  languages. The extreme case is the spoken dialects,
                                                                    where there might be no standard spelling at all. In
     This paper presents our approach for                           this paper, we especially focus on Swiss German as
     SwissText & KONVENS 2020 shared task 2,                        our low-resource dialect. As Swiss German is a spoken
     which is a multi-stage neural model for Swiss                  dialect, people might spell a certain word differently,
     German (GSW) identification on Twitter.                        and even a single author might use different spelling
     Our model outputs either GSW or non-GSW                        for a word between two sentences. There also exists
     and is not meant to be used as a generic                       a dialect continuum across the German-speaking part
     language identifier. Our architecture consists                 of Switzerland, which makes NLP for Swiss German
     of two independent filters where the first one                 even more challenging. Swiss German has its own
     favors recall, and the second one filter favors                pronunciation, grammar and also lots of its words are
     precision (both towards GSW). Moreover,                        different from German.
     we do not use binary models (GSW vs.                               There exists some previous efforts for discriminating
     not-GSW) in our filters but rather a multi-class               similar languages with the help of tweets metadata
     classifier with GSW being one of the possible                  such as geo-location (Williams and Dagli, 2017), but in
     labels. Our model reaches F1-score of 0.982                    this paper, we do not use tweets metadata and restrict
     on the test set of the shared task.                            our model to only use tweet content. Therefore, this
                                                                    model can also be used for language identification in
1 Introduction                                                      sources other than Twitter.
Out of over 8000 languages in the world (Hammarström                   LIDs that support GSW like fastText (Joulin et al.,
et al., 2020), Twitter language identifier (LID) only               2016) LID model are often trained by using Alemannic
supports around 30 of the most used languages1, which               Wikipedia, which also contains other German dialects
is not enough for NLP community needs. Furthermore,                 such as Swabian, Walser German, and Alsatian Ger-
it has been shown that even for these frequently used               man; hence, these models are not able to discriminate
languages, Twitter LID is not highly accurate, especially           dialects that are close to GSW. Moreover, fastText LID
when the tweet is relatively short (Zubiaga et al., 2016).          also has a pretty low recall (0.362) for Swiss German
   However, Twitter data is linguistically diverse                  tweets, as it identified many of them as German.
and especially includes tweets in many low-resource                     In this paper, we use two independently trained
languages/dialects. Having a better performing Twitter              filters to remove non-GSW tweets. The first filter
LID can help us to gather large amounts of (unlabeled)              is a classifier that favors recall (towards GSW), and
text in these low-resource languages that can be used to            the second one favors precision. The exact same
enrich models in many down-stream NLP tasks, such                   idea can be extended to N consecutive filters (with
as sentiment analysis (Volkova et al., 2013) and named              N ≥ 2), with the first N − 1 favoring recall and the
entity recognition (Ritter et al., 2011).                           last filter favoring precision. In this way, we make
   However, the generalization of state-of-the-art NLP              sure that GSW samples are not filtered out (with high
models to low-resource languages is generally hard                  probability) in the first N −1 iterations, and the whole
due to the lack of corpora with good coverage in these              pipeline GSW precision can be improved by having a
                                                                    filter that favors precision at the end (N-th filter). The
Copyright c 2020 for this paper by its authors. Use permitted       reason that we use only two filters is that adding more
under Creative Commons License Attribution 4.0 Interna-tional       filters improved the performance (measured by GSW
(CC BY 4.0)
    1
      https://dev.twitter.com/docs/developer-utilities/supported-   F1-score) negligibly on our validation set.
languages/api-reference                                                 We demonstrate that by using this architecture, we
can achieve F1-score of 0.982 on the test set, even with          However, this LM has been trained using sentences
a small amount of available data in the target domain          (e.g., German Wikipedia) that are quite different
(Twitter data). Section 2 presents the architecture of         from the Twitter domain. Moreover, lack of standard
each of our filters and the rationale behind the chosen        spelling in GSW introduces many new words (unseen
training data for each of them. In section 3, we discuss       in German LM training data) that their respective
our LID implementation details and also discuss the            subwords embedding should be updated in order to
detailed description of used datasets. Section 4 presents      improve the downstream task performance. In addition,
the performance of our filters on the held-out test            there are even syntactic differences between German
dataset. Moreover, we demonstrate the contribution             and GSW (and even among different variations of GSW
of each of the filters on removing non-GSW filters to          in different regions (Honnet et al., 2017)). For these
see their individual importance in the whole pipeline          three reasons, we can conclude that freezing the BERT
(for this specific test dataset).                              body (and just training the classifier layer) might not
                                                               be optimal for this transfer learning between German
2 Multi-filter language identification                         and our target language. Hence, we also let the whole
In this paper, we follow the combination of N −1 fil-          BERT body be trained during the downstream task,
ters favoring recall, followed by a final filter that favors   which of course needs a large amount of supervised
more precision. We choose N = 2 in this paper to               data to avoid quick overfitting in the fine-tuning phase.
demonstrate the effectiveness of the approach. As dis-            For this filter, we choose the same eight classes for
cussed before, adding more filters improved the perfor-        training LID as Linder et al. (2019) (the dataset classes
mance of the pipeline negligibly for this specific dataset.    and their respective sizes can be found in section 3.1).
However, for more challenging datasets, it might be            These languages are similar in structure to GSW (such
needed to have N >2 to improve the LID precision.              as German, Dutch, etc.), and we try to train a model
   Both of our filters are multi-class classifiers with        that can distinguish GSW from similar languages
GSW being one of the possible labels. We found it              to decrease GSW false positives. For all classes
empirically better to use roughly balanced classes for         except GSW, we use sentences (mostly Wikipedia
training the multi-class classifier, rather than making        and Newscrawl) from Leipzig text corpora (Goldhahn
the same training data a highly imbalanced GSW vs.             et al., 2012). We also use the SwissCrawl (Linder et al.,
non-GSW training data for a binary classifier, especially      2019) dataset for GSW sentences.
for the first filter (section 2.1) which has much more            Most GSW training samples (SwissCrawl data)
parameters compared to the second filter (section 2.2).        come from forums and social media, which are less
                                                               formal (in structure and also used phrases) than other
2.1 First filter: fine-tuned BERT model                        (non-GSW) classes samples (mostly from Wikipedia
The first filter should be designed in a way to favor          and NewsCrawl). Moreover, as our target dataset
GSW recall, either by tuning inference thresholds or           consist of tweets (mostly informal sentences), this
by using training data that implicitly enforces this bias      could make this filter having high GSW recall during
towards GSW. Here we follow the second approach                the inference phase. Additionally, our main reason for
for this filter by using different domains for training        using a cased tokenizer for this filter is to let the model
different labels, which is further discussed below.            also use irregularities in writing, such as improper
Moreover, we use a more complex (in terms of the               capitalization. As these irregularities mostly occur in
number of parameters) model for the first filter, so           informal writing, it will again bias the model towards
that it does the main job of removing non-GSW inputs           GSW (improving GSW recall) when tweets are passed
while having reasonable GSW precision (further detail          to it, as most of the GSW training samples are informal.
in section 4). The second filter will be later used to
improve the pipeline precision by removing a relatively        2.2 Second filter: fastText classifier
smaller number of non-GSW tweets.                              For this filter, we also train a multiclass classifier with
   Our first filter is a fine-tuned BERT (Devlin et al.,       GSW being one of the labels. The other classes are
2018) model for the LID downstream task. As we                 again close languages (in structure) to GSW such
do not have a large amount of unsupervised GSW                 as German, Dutch and Spanish (further detail in
data, it will be hard to train the BERT language model         section 3.1). Additionally, as mentioned before, our
(LM) from scratch on GSW itself. Hence, we use the             second filter should have a reasonably high precision
German pre-trained LM (BERT-base-cased model2),                to enhance the full pipeline precision. Hence, unlike
which is the closest high-resource language to GSW.            the first filter, we choose the whole training data
  2
    Training details available at https://huggingface.         to be sampled from a similar domain to the target
co/bert-base-german-cased (Wolf et al., 2019)                  test set. non-GSW samples are tweets from SEPLN
2014 (Zubiaga et al., 2014) and Carter et al. (2013)                       Language             Number of samples
dataset. GSW samples consist of this shared task                           Afrikaans                    250000
provided GSW tweets and also part of GSW samples                           German                       100000
of Swiss SMS corpus (Stark et al., 2015) dataset.                          English                      250000
   As the described training data is rather small com-                     Swiss-German                 250000
pared to the first filter training, we should also train a                 GSW-like                     250000
simpler architecture with significantly fewer parameters.                  Luxembourgian                250000
We take advantage of fastText (Joulin et al., 2016) for                    Dutch                        250000
training this model, which is based on a bag of character                  Other                        250000
n-grams in our case. Moreover, unlike the first filter,
                                                                    Table 1: Distribution of samples in the first filter dataset
this model is not a cased model, and we make input sen-
tences lower-case to reduce vocab size. Our used hyper-
parameters for this model can be found in section 3.                        Language            Number of samples
                                                                            Catalan                     2000
3 Experimental Setup                                                        Dutch                        560
                                                                            English                     2533
In this section, we describe the datasets and the hyper-                    French                       639
parameters for both filters in the pipeline. We also                        German                      3608
describe our preprocessing method that is specifically                      Spanish                     2707
designed to handle inputs from social media.                                Swiss German                4971

3.1 Datasets                                                       Table 2: Distribution of samples in the second filter dataset

For both filters, we use 80% of data for training, 5%
for validation set and 15% for the test set.                       3.2 Preprocessing
3.1.1 First filter                                                 As the dataset sentences are mostly from social media,
The sentences are from Leipzig corpora (Goldhahn                   we used a custom tokenizer that removes common
et al., 2012) and SwissCrawl (Linder et al., 2019)                 social media tokens (emoticons, emojis, URL, hashtag,
dataset. The classes and the number of samples in                  Twitter mention) that are not useful for LID. We also
each class are shown in Table 1. We pick the proposed              normalize word elongation as it might be misleading
classes by Linder et al. (2019) for training GSW LID.              for LID. In the second filter, we also make the input
The main differences of our first filter with their LID            sentences lower-case before passing it to the model.
are the GSW sentences and the fact that our fine-tuning
dataset is about three times larger than theirs. Each              3.3 Implementation details
of “other”3 and “GSW-like”4 classes are a group of lan-
guages where their respective members cannot be repre-             3.3.1 BERT filter
sented as a separate class due to having a small number            We train this filter by fine-tuning a German pre-trained
of samples. The GSW-like is included to make sure                  BERT-cased model on our LID task. As mentioned
that the model can distinguish other German dialects               before, we do not freeze the BERT body in the
from GSW (hence, reducing GSW false positives).                    fine-tuning phase. We train it for two epochs, with
                                                                   a batch size of 64 and max-seq-length of 64. We
3.1.2 Second filter                                                use Adam optimizer (Kingma and Ba, 2014) with a
The sentences are mostly from Twitter (except for                  learning rate of 2e-5.
some GSW samples from Swiss SMS corpus (Stark
et al., 2015)). In Table 2, we can see the distribution            3.3.2 fastText filter
of different classes. The GSW samples consist of 1971
tweets (provided by shared task organizers) and 3000               We train this filter using fastText (Joulin et al., 2016)
GSW samples from Swiss SMS corpus.                                 classifier for 30 epochs using character n-grams
                                                                   as features (where 2 ≤ n ≤ 5) and the embedding
    3
      Catalan, Croatian, Danish, Esperanto, Estonian, Finnish,     dimension set to 50. To favor precision during
French, Irish, Galician, Icelandic, Italian, Javanese, Konkani,    inference, we label a tweet as GSW if the model
Papiamento, Portuguese, Romanian, Slovenian, Spanish, Swahili,     probability for GSW is greater than 64% (this threshold
Swedish
    4
      Bavarian, Kolsch, Limburgan, Low German, Northern Frisian,   is seen as a hyper-parameter and was optimized
Palatine German                                                    according to validation set).
4 Results                                                                Label        Number of samples
In this section, we evaluate our two filters performance                 not-GSW               2782
(either in isolation or when present in the full pipeline)               GSW                   2592
on the held-out test dataset of the shared task. We also
evaluate the BERT filter on its test data (Leipzig and                Table 6: Distribution of labels in test set
SwissCrawl samples).
4.1 BERT filter performance                                   4.3 Discussion
    on Leipzig + SwissCrawl corpora                           Our designed LID outperforms the baseline signifi-
We first evaluate our BERT filter on the test set of the      cantly (Table 4) which underlines the importance of
first filter (Leipzig corpora + SwissCrawl). In Table         having a domain-specific LID. Additionally, although
3 we demonstrate the filter performance on different          the positive effect of the second filter is quite small on
labels. The filter has an F1-score of 99.8% on the            the test set, when we applied the same architecture on
GSW test set. However, when this model is applied             randomly sampled tweets (German tweets according to
to Twitter data, we expect a decrease in performance          Twitter API), we observed that having the second filter
due to having short and also informal messages.               could reduce the number of GSW false positives sig-
                                                              nificantly. Hence, the number of used filters is indeed
   Language            Precision Recall F1-score              totally dependent on the complexity of the target dataset.

   Afrikaans            0.9982     0.9981     0.9982          5 Conclusion
   German               0.9976     0.9949     0.9962
   English              0.9994     0.9992     0.9993          In this work, we propose an architecture for spoken
   Swiss-German         0.9974     0.9994     0.9984          dialect (Swiss German) identification by introducing
   GSW-like             0.9968     0.9950     0.9959          a multi-filter architecture that is able to filter out
   Luxembourgian        0.9994     0.9989     0.9992          non-GSW tweets during the inference phase effectively.
   Dutch                0.9956     0.9965     0.9960          We evaluated our model on the GSW LID shared task
   Other                0.9983     0.9989     0.9986          test-set, and we reached an F1-score of 0.982.
                                                                 However, there are other useful features that can be
Table 3: First filter performance on Leipzig + SwissCrawl
                                                              used during training, such as orthographic conventions
corpora
                                                              in GSW writing, as observed by Honnet et al. (2017),
                                                              which their presence might not be easily captured even
4.2 Performance on the shared-task test set                   by a complex model like BERT. Moreover, in this paper,
In Table 4, we can see both filters performance either in     we did not use tweets metadata as a feature and only
isolation or when they are used together. As shown in         focused on tweet content, although they can improve
this table, the model improvement by adding the second        LID classification for dialects considerably (Williams
filter is rather small. The main reason can be seen in        and Dagli, 2017). These two, among others, are future
Table 5 as the majority of non-GSW filtering is done          works that need to be further studied to see their
by the first filter for the shared-task test set (Table 6).   usefulness for low-resource language identification.

  Model                 Precision Recall F1-score
                                                              References
  BERT filter            0.9742     0.9896     0.9817
  fastText Filter        0.9076     0.9892     0.9466         Simon Carter, Wouter Weerkamp, and Manos Tsagkias.
                                                                2013. Microblog language identification: Overcoming
  BERT + fastText        0.9811     0.9834     0.9823
                                                                the limitations of short, unedited and idiomatic text.
  fastText Baseline      0.9915     0.3619     0.5303           Language Resources and Evaluation, 47(1):195–215.
Table 4: Filters performance on the shared task test-set      Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
compared to fastText (Joulin et al., 2016) LID baseline          Toutanova. 2018. Bert: Pre-training of deep bidirec-
                                                                 tional transformers for language understanding. arXiv
                                                                 preprint arXiv:1810.04805.
     Model             Number of filtered samples             Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.
     BERT filter                   2741                          2012. Building large monolingual dictionaries at the
     fastText Filter                35                           leipzig corpora collection: From 100 to 200 languages.
                                                                 In LREC, volume 29, pages 31–43.
 Table 5: Number of Non-GSW removals by each filter           Harald Hammarström, Sebastian Bank, Robert Forkel,
                                                                and Martin Haspelmath. 2020. Glottolog 4.2.1. Max
  Planck Institute for the Science of Human History,
  Jena. Available online at http://glottolog.org, Accessed
  on 2020-04-18.
Pierre-Edouard Honnet, Andrei Popescu-Belis, Claudiu
   Musat, and Michael Baeriswyl. 2017.         Machine
   translation of low-resource spoken dialects: Strate-
   gies for normalizing swiss german. arXiv preprint
   arXiv:1710.11035.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
  Tomas Mikolov. 2016. Bag of tricks for efficient text
  classification. arXiv preprint arXiv:1607.01759.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
  method for stochastic optimization. arXiv preprint
  arXiv:1412.6980.
Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu
  Musat, and Andreas Fischer. 2019. Automatic creation
  of text corpora for low-resource languages from the
  internet: The case of swiss german. arXiv preprint
  arXiv:1912.00159.
Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named
  entity recognition in tweets: an experimental study. In
  Proceedings of the conference on empirical methods
  in natural language processing, pages 1524–1534.
  Association for Computational Linguistics.
Elisabeth Stark, Simon Ueberwasser, and Beni Ruef.
   2015. Swiss sms corpus. www.sms4science.ch.
Svitlana Volkova, Theresa Wilson, and David Yarowsky.
  2013. Exploring demographic language variations
  to improve multilingual sentiment analysis in social
  media. In Proceedings of the 2013 Conference on
  Empirical Methods in Natural Language Process-
  ing, pages 1815–1827, Seattle, Washington, USA.
  Association for Computational Linguistics.
Jennifer Williams and Charlie Dagli. 2017. Twitter lan-
   guage identification of similar languages and dialects
   without ground truth. In Proceedings of the Fourth
   Workshop on NLP for Similar Languages, Varieties
   and Dialects (VarDial), pages 73–83.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Chaumond, Clement Delangue, Anthony Moi, Pierric
  Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al.
  2019. Transformers: State-of-the-art natural language
  processing. arXiv preprint arXiv:1910.03771.
Arkaitz Zubiaga, Inaki San Vicente, Pablo Gamallo, José
  Ramom Pichel Campos, Iñaki Alegrı́a Loinaz, Nora
  Aranberri, Aitzol Ezeiza, and Vı́ctor Fresno-Fernández.
  2014. Overview of tweetlid: Tweet language identifica-
  tion at sepln 2014. In TweetLID@ SEPLN, pages 1–11.
Arkaitz Zubiaga, Inaki San Vicente, Pablo Gamallo,
  José Ramom Pichel, Inaki Alegria, Nora Aranberri,
  Aitzol Ezeiza, and Vı́ctor Fresno. 2016. Tweet-
  lid: a benchmark for tweet language identification.
  Language Resources and Evaluation, 50(4):729–766.

</pre>