=Paper=
{{Paper
|id=Vol-2624/germeval-task2-paper4
|storemode=property
|title=Idiap Submission to Swiss-German Language Detection Shared Task
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task2-paper4.pdf
|volume=Vol-2624
|authors=Shantipriya Parida,Esaú Villatoro-Tello,Sajit Kumar,Petr Motlicek,Qingran Zhan
|dblpUrl=https://dblp.org/rec/conf/swisstext/ParidaVKMZ20
}}
==Idiap Submission to Swiss-German Language Detection Shared Task==
<pdf width="1500px">https://ceur-ws.org/Vol-2624/germeval-task2-paper4.pdf</pdf>
<pre>
     Idiap Submission to Swiss-German Language Detection Shared Task
                     Shantipriya Parida1 , Esaú Villatoro-Tello2,1 , Sajit Kumar3 ,
                                Petr Motlicek1 and Qingran Zhan1
               1
                Idiap Research Institute, Rue Marconi 19, 1920 Martigny, Switzerland.
                               firstname.lastname@idiap.ch
         2
           Universidad Autónoma Metropolitana, Unidad Cuajimalpa, Mexico City, Mexico.
                              evillatoro@correo.cua.uam.mx
    3
      Centre of Excellence in AI, Indian Institute of Technology, Kharagpur, West Bengal, India.
                                 kumar.sajit.sk@gmail.com

                        Abstract                                   It is a difficult task to discriminate between very
                                                                close languages or dialects (for example, German
     Language detection is a key part of the                    dialect identification, Indo-Aryan language identifi-
     NLP pipeline for text processing. The task                 cation (Jauhiainen et al., 2019a)). Although dialect
     of automatically detecting languages be-                   identification is commonly based on the distribu-
     longing to disjoint groups is relatively easy.             tions of letters or letter n-grams, it may not be possi-
     It is considerably challenging to detect lan-              ble to distinguish related dialects with very similar
     guages that have similar origins or dialects.              phoneme and grapheme inventories for some lan-
     This paper describes Idiap’s submission to                 guages (Scherrer and Rambow, 2010).
     the 2020 Germeval evaluation campaign1                        Many authors proposed traditional machine
     on Swiss-German language detection. In                     learning approaches for language detection like
     this work, we have given high dimensional                  Naive Bayes, SVM, word and character n-grams,
     features generated from the text data as                   graph-based n-grams, prediction partial matching
     input to a supervised autoencoder for de-                  (PPM), linear interpolation with post-independent
     tecting languages with dialect variances.                  weight optimization and majority voting for com-
     Bayesian optimizer was used to fine-tune                   bining multiple classifiers, etc. (Jauhiainen et al.,
     the hyper-parameters of the supervised au-                 2019b).
     toencoder. To the best of our knowledge,                      More recently, deep learning techniques have
     we are first to apply supervised autoen-                   shown substantial performance in many NLP tasks
     coder for the language detection task.                     including language detection (Oro et al., 2018).
                                                                In the context of deep learning techniques, many
1    Introduction                                               papers have demonstrated the capability of semi-
The increased usage of smartphones, social me-                  supervised autoencoders solving different tasks, in-
dia, and the internet has led to rapid growth in the            dicating that the use of autoencoders allows learn-
generation of short linguistic texts. Thus, identifi-           ing a representation when trained with unlabeled
cation of language is a key component in building               data. (Ranzato and Szummer, 2008; Rasmus et al.,
various NLP resources (Kocmi and Bojar, 2017).                  2015). However, as per our literature survey, none
Language detection is the task of determining the               of the recent research has applied autoencoder for
language for the given text. Although it has pro-               the language detection task. In this paper, we
gressed substantially, still few challenges exist: (1)          propose a supervised configuration of the autoen-
distinguishing among similar languages, (2) detec-              coders, which utilizes labels for learning the repre-
tion of languages when multiple language contents               sentation. To the best of our knowledge, this is the
exist within a single document, and (3) language                first time this technology is evaluated in the context
identification in very short texts (Balazevic et al.,           of the language detection task.
2016; Lui et al., 2014; Williams and Dagli, 2017).
                                                                1.1   Supervised Autoencoder
Copyright c 2020 for this paper by its authors. Use permitted   An autoencoder (AE) is a neural network that learns
under Creative Commons License Attribution 4.0 Interna-         a representation (encoding) of input data and then
tional (CC BY 4.0)
    1
      https://sites.google.com/view/                            learns to reconstruct the original input from the
gswid2020                                                       learned representation. The autoencoder is mainly
used for dimensionality reduction or feature extrac-               Global optimization is considered to be a chal-
tion (Zhu and Zhang, 2019). Normally, it is used                lenging problem of finding the globally best solu-
in an unsupervised learning fashion, meaning that               tion of (possibly nonlinear) models, in the (possi-
we leverage the neural network for the task of rep-             ble or known) presence of multiple local optima.
resentation learning. By learning to reconstruct the            Bayesian optimization (BO) is shown to outper-
input, the AE extracts underlying abstract attributes           form other state-of-the-art global optimization algo-
that facilitate accurate prediction of the input.               rithms on several challenging optimization bench-
   Thus, a supervised autoencoder (SAE) is an au-               mark functions (Snoek et al., 2012; Bergstra and
toencoder with the addition of a supervised loss                Bengio, 2012). BO provides a principled technique
on the representation layer. For the case of a sin-             based on Bayes theorem to direct a search for a
gle hidden layer, a supervised loss is added to the             global optimization problem that is efficient and ef-
output layer and for a deeper autoencoder, the in-              fective. It works by building a probabilistic model
nermost (smallest) layer would have a supervised                of the objective function, called the surrogate func-
loss added to the bottleneck layer that is usually              tion, that is then searched efficiently with an acqui-
transferred to the supervised layer after training the          sition function before candidate samples are chosen
autoencoder.                                                    for evaluation on the real objective function. It tries
   In supervised learning, the goal is to learn a               to solve the minimization problem:
function for a vector of inputs x ∈ Rd to predict
a vector of targets y ∈ Rm . Consider SAE with                                   X ∗ = arg min f (x),                    (2)
                                                                                               x∈χ
a single hidden layer of size k, and the weights
for the first layer are F ∈ Rk×d . The function is              where we consider χ to be a compact subset of Rk
trained on a finite batch of independent and identi-            (Snoek et al., 2015).
cally distributed (i.i.d.) data, (x1 , y1 ), ..., (xt , yt ),      Thus, we employed BO for hyperparameter op-
with the goal of a more accurate prediction on                  timization where the objective is to find the hyper-
new samples generated from the same distribution.               parameters of a given machine learning algorithm,
The weight for the output layer consists of weights             for this, we preserved the best performance as mea-
Wp ∈ Rm×k to predict y and Wr ∈ Rd×k to re-                     sured on a validation set.
construct x. Let Lp be the supervised loss and Lr
be the loss for the reconstruction error. In the case           2    Proposed Method
of regression, both losses might be represented by              The architecture of the proposed model is shown
a squared error, resulting in the objective:                    in Figure 1. We used character n-grams as fea-
                                                                tures from the input text. In comparison to word
     t
   1 Xh                                      i                  n-grams, which only capture the identity of a word
        Lp (Wp Fxi , yi ) + Lr (Wr Fxi , xi ) =                 and its possible neighbors, character n-grams are
   t
      i=1
                                                                additionally capable of providing an excellent trade-
      t h
 1    X                                                 i
                                                                off between sparseness and word’s identity, while
            ||Wp Fxi − yi ||22 + ||Wr Fxi − xi ||22
 2t                                                             at the same time they combine different types of
      i=1
                                                        (1)     information: punctuation, morphological makeup
                                                                of a word, lexicon and even context (Wei et al.,
   The addition of supervised loss to the autoen-               2009; Kulmizev et al., 2017; Sánchez-Vega et al.,
coder loss function acts as regularizer and results             2019). The extracted n-gram features are input to
(as shown in equation 1) in the learning of the better          the deep SAE as shown in the Figure 1. The deep
representation for the desired task (Le et al., 2018).          SAE contains multiple hidden layers. We used the
                                                                BO for selecting the optimal parameters.
1.2    Bayesian Optimizer
                                                                3    Experimental Setup and Datasets
In the case of SAE, there are many hyperparameters
related to (a) Model construction and (b) Optimiza-             The training dataset was provided by the organizers
tion. Hence, SAE training without any hyperparam-               of the shared task. The training2 dataset consists of
eter tuning usually results in poor performance due             2,000 tweets in the Swiss-German language. The
to the dependencies that may result in simultaneous                 2
                                                                      Although 2K Twitter ids were provided, we were not able
over/under-fitting.                                             to retrieve them all, resulting in 1976 training instances.
Figure 1: Proposed model architecture. The extracted features of the text are fed to the supervised autoencoder.
The targets “y” are included. The classification output are the language ids for the classified languages.


participants were allowed to use any additional                190,000 sentences categorized into 10 lan-
resources as training datasets. As part of the addi-           guages (English, French, Portuguese, Chinese
tional resources recommended by the organizers,                Mandarin, Russian, Hebrew, Polish, Japanese,
the following Swiss-German datasets were sug-                  Italian, Dutch) mainly used for language de-
gested: NOAH 3 (Hollenstein and Aepli, 2015),                  tection and benchmarking NLP algorithms.
and SwissCrawl 4 (Linder et al., 2019); which we               We considered “Ling10-trainlarge” (one of
used in our experiments.                                       the three variants of Ling10 dataset) in our
   The test data released by the organizers consists           experiment.
of 5,374 Tweets (mix of different languages) to
                                                           Group Name              Language               Id
be classified as Swiss-German versus not Swiss-
                                                           South Eastern Slavic    Bulgarian              bg
German.                                                                            Macedonian             mk
   The training dataset provided by the organizer          South Western Slavic    Bosnian                bs
                                                                                   Croatian               hr
did not have any non-Swiss-German text. In addi-                                   Serbian                sr
tion to the recommended Swiss-German datasets,             West-Slavic             Czech                  cz
we have used other non-Swiss-German datasets                                       Slovak                 sk
                                                           Ibero-                  Peninsular Spain       es-ES
(DSL 5 (Tan et al., 2014a), and Ling10 6 ) for train-      Romance(Spanish)
ing our models.                                                                    Argentinian Spanish    es-AR
   • DSL Dataset: The data obtained from the               Ibero-                  Brazilian Portuguese   pt-BR
                                                           Romance(Portuguese)
     “Discriminating between Similar Language                                      European Portuguese    pt-PT
     (DSL) Shared Task 2015” contains 13 dif-              Astronesian             Indonesian             id
      ferent languages as shown in Table 1. The                                    Malay                  my
      DSL corpus collection have different versions
      based on different language group which pro-       Table 1: DSL Language Group. Similar languages
     vides datasets for researchers to test their sys-   with their language code.
      tems (Tan et al., 2014a). We selected DSLCC
     version 2.0 7 in our experiments (Tan et al.,          As the task is a binary classification of Swiss-
      2014b).                                            German versus not Swiss-German, we have split
   • Ling10 Dataset : The Ling10 dataset contains        all our collection of datasets including the training
                                                         set provided by the organizers into two categories
   3
    https://noe-eva.github.io/                           as follows:
NOAH-Corpus/
  4
    https://icosys.ch/swisscrawl
                                                            • Swiss-German (NOAH, SwissCrawl, Swiss-
  5
    http://ttg.uni-saarland.de/resources/                      German Training Tweets).
DSLCC/                                                      • not Swiss-German (DSL, Ling10).
  6
    https://github.com/johnolafenwa/                        Accordingly, we labeled the target class of all
Ling10
  7
    https://github.com/Simdiva/DSL-Task/                 the Swiss-German text as “gsw” (Swiss-German)
tree/master/data/DSLCC-v2.0                              and labeled the target class of all other language
text as “not gsw”).
   We prepared three settings (S1, S2, and S3) com-
bining the above datasets in different proportions
of Swiss-German versus not Swiss-German lan-
guages for training the model. The statistics of the
datasets for the settings are shown in Table 2.
   We mixed the datasets of Swiss-German and
other languages and split them into different ratios
for training and development as per the settings. In
each setting, the training and development set is
different based on the selection of the number of          Confusion matrix for setting S1 on dev set.
sentences from each dataset. We used the test set
provided by the shared task organizers. As the test
set includes twitter text during preprocessing, we
removed emojis and other unnecessary symbols.
   The range of values for the hyperparameters
search space is shown in Table 3. During training,
BO chooses the best hyperparameters from this
range. The overall configuration of the SAE model
is shown in Table 4.

4   Results and Discussion                                 Confusion matrix for setting S2 on dev set.
We evaluated the development set performance and
the test set evaluation performed by the shared task
organizers. The development set performance is
given in section Section 4.1 and the test set perfor-
mance in Section 4.2.
   Our evaluation includes calculating classification
accuracy based on the predicted label compared
with the actual label. The organizers calculated pre-
cision, average precision, recall, and F1 score for
each of the submissions. As known, precision is the
ratio of correctly predicted positive observations            Confusion matrix for setting S3 on dev set.
to the total predicted positive observations; recall
                                                        Figure 2: Confusion matrix on the development (dev)
(or sensitivity) is the ratio of correctly predicted
                                                        set for the setting S1, S2, and S3. The confusion matrix
positive observations to all observations in actual     shows the correct and incorrect predictions with count
positive class, and the F1 score is the weighted        values broken down by each class i.e. “gsw” (Swiss-
average of precision and recall.                        German) or “not gsw” (not Swiss-German).
   Organizers also generated the Receiver Operat-
ing Characteristic curve (ROC), Area Under the
ROC Curve (AUC), and Precision-Recall (PR)              model is represented by a curve that bows towards
curves. The AUC - ROC curve is a performance            (1,1).
measurement at various threshold settings. ROC is
a probability curve and AUC represents the degree       4.1     Development Set
or measure of separability. It indicates how much a     The SAE model performance for the three settings
trained model is capable of distinguishing between      (S1, S2, and S3) on the development set is shown in
classes, thus, the higher the AUC, the better the       Table 5. The confusion matrix for all the settings
model performance. Finally, PR curves summarize         on the development set is shown in Figure 2. The
the trade-off between the true positive rate and the    confusion matrix shows the correct and incorrect
positive predictive value for a predictive model us-    predictions with count values broken down by each
ing different probability thresholds; hence, a good     class i.e. “gsw” (Swiss-German) or “not gsw” (not
   Setting     Datasets and Language                    Distribution        Distribution              Training     Dev        Test
                                                                            (Overall)
   S1          NOAH (Swiss-German)                      7,327 (8%)          50% Swiss-German          80,000       20,000     5,374
               SwissCrawl (Swiss-German)                40,697 (40%)        50% not Swiss-German
               SwissTextTrain (Swiss-German)            1,976 (2 %)
               DSL (not Swiss-German)                   25,000 (25 %)
               Ling10 (not Swiss-German)                25,000 (25 %)
   S2          NOAH (Swiss-German)                      7,327 (5%)          61% Swiss-German          130,000      20,000     5,374
               SwissCrawl (Swiss-German)                81,841 (55 %)       39% not Swiss German
               SwissTextTrain (Swiss-German)            1,976 (1 %)
               DSL (not Swiss-German)                   25,000 (17 %)
               Ling10 (not Swiss-German)                33,856 (22 %)
   S3          NOAH (Swiss-German)                      7,327 (4 %)         46% Swiss-German          180,000      20,000     5,374
               SwissCrawl (Swiss-German)                81,841 (41 %)       54% not Swiss-German
               SwissTextTrain (Swiss-German)            1,976 (1 %)
               DSL (not Swiss-German)                   50,000 (25 %)
               Ling10 (not Swiss-German)                58,856 (29 %)

Table 2: Dataset Statistics. The training-development-test set distribution for each of setting (S1, S2 and S3). The
distribution is based on the number of sentences selected from the datasets.


             Hyper Parameter          Range                                                        Precision     Recall      F1
             number of layer          1-5                                         IDIAP              0.775       0.998      0.872
             learning rate            10−5 − 10−2                                jj-cl-uzh           0.945       0.993      0.968
             weight decay             10−6 − 10−3                              Mohammadreza          0.984       0.979      0.982
             activation functions     ‘relu’, ‘sigma’                             Banaei

      Table 3: Search space hyper parameter range.                      Table 6: Shared task result announced by the organiz-
                                                                        ers displaying participant team and their model perfor-
        Parameter                   Value
                                                                        mance (Precision, Recall, and F1).
        char n gram range           1-3
        number of target            2
        embedding dimension         300                                                Prec    Rec       F1       Avg.
                                                                            Setting                                         AUROC
        supervision                 ‘clf’ (classification)                            (gsw)   (gsw)     (gsw)     Prec
        converge threshold          0.00001                                   S1      0.649   0.997     0.786     0.871      0.924
        number of epochs            500                                       S2      0.673   0.997     0.804     0.911      0.946
                                                                              S3      0.775   0.998     0.872     0.965      0.975
 Table 4: SAE model configuration used for training.
                                                                             Table 7: Performance of setting S1, S2, and S3.
Swiss-German).
                                          Accuracy (%)                     Based on our initial analysis, we presume that
      Model                 Setting      Development Set                the low performance of the SAE on the test set is
      SAE (char-3gram)        S1               100                      due to the very few samples of twitter data available
      SAE (char-3gram)        S2               100
      SAE (char-3gram)        S3               100                      in the training data.

Table 5: Swiss-German language detection perfor-                        5     Conclusion
mance (classification accuracy) of the proposed model
on the development set based on the setting S1, S2, and                 In this paper, we have shown the pertinence of SAE
S3.                                                                     with Bayesian optimizer for the language detection
                                                                        task. Obtained results are encouraging, and SAE
                                                                        was found effective for discriminate between very
4.2     Test Set                                                        close languages or dialects. The proposed model
The overall result announced by the organizers on                       can be extended by creating a host of features such
test set is shown in the Table 6 and in the Figure 3.                   as character n-gram, word n-gram, word counts, etc
Our submission labeled as “IDIAP”, obtained the                         and then passing it through autoencoder to choose
results 0.777, 0.998, and 0.872 for precision (prec),                   the best features. In future work, we plan to (i) ver-
recall (rec), and F1 score respectively for the setting                 ify our model (SAE with BO) with other language
S3 as shown in Table 6. The detailed performance                        detection datasets, and (ii) include more short texts,
of each of our setting is shown in Table 7.                             particularly Twitter data, in the training set and
    Figure 3: Official results announced by the organizers displaying team’s performance (ROC, PR curves).


verify the performance of our model under a more             A survey. Journal of Artificial Intelligence Research,
balanced data type scenario.                                 65:675–782.
                                                           Tom Kocmi and Ondřej Bojar. 2017. Lanidenn: Multi-
Acknowledgments                                              lingual language identification on character window.
                                                             In Proceedings of the 15th Conference of the Euro-
The work was supported by an innovation project              pean Chapter of the Association for Computational
(under an InnoSuisse grant) oriented to improve              Linguistics: Volume 1, Long Papers, pages 927–936.
the automatic speech recognition and natural lan-          Artur Kulmizev, Bo Blankers, Johannes Bjerva, Malv-
guage understanding technologies for German. Ti-             ina Nissim, Gertjan van Noord, Barbara Plank, and
tle: “SM2: Extracting Semantic Meaning from                  Martijn Wieling. 2017. The power of character n-
Spoken Material” funding application no. 29814.1             grams in native language identification. In Proceed-
                                                             ings of the 12th Workshop on Innovative Use of NLP
IP-ICT and EU H2020 project “Real-time network,              for Building Educational Applications, pages 382–
text, and speaker analytics for combating organized          389.
crime” (ROXANNE), grant agreement: 833635.
                                                           Lei Le, Andrew Patterson, and Martha White. 2018.
The second author, Esaú Villatoro-Tello is sup-             Supervised autoencoders: Improving generalization
ported partially by Idiap, UAM-C Mexico, and                 performance with unsupervised regularizers. In Ad-
SNI-CONACyT Mexico during the elaboration of                 vances in Neural Information Processing Systems,
this work.                                                   pages 107–117.
                                                           Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu
                                                             Musat, and Andreas Fischer. 2019. Automatic cre-
References                                                   ation of text corpora for low-resource languages
                                                             from the internet: The case of swiss german.
Ivana Balazevic, Mikio Braun, and Klaus-Robert
   Müller. 2016.    Language detection for short          Marco Lui, Jey Han Lau, and Timothy Baldwin. 2014.
   text messages in social media. arXiv preprint            Automatic detection and language identification of
   arXiv:1608.08515.                                        multilingual documents. Transactions of the Associ-
                                                            ation for Computational Linguistics, 2:27–40.
James Bergstra and Yoshua Bengio. 2012. Random
  search for hyper-parameter optimization. Journal of      Ermelinda Oro, Massimo Ruffolo, and Mostafa
  machine learning research, 13(Feb):281–305.                Sheikhalishahi. 2018. Language identification of
                                                             similar languages using recurrent neural networks.
Nora Hollenstein and Noëmi Aepli. 2015. A resource          In ICAART.
  for natural language processing of swiss german di-      Marc’Aurelio Ranzato and Martin Szummer. 2008.
  alects.                                                   Semi-supervised learning of compact document rep-
                                                            resentations with deep networks. In Proceedings of
Tommi Jauhiainen, Krister Lindén, and Heidi Jauhi-         the 25th international conference on Machine learn-
  ainen. 2019a. Language model adaptation for lan-          ing, pages 792–799.
  guage and dialect identification of text. Natural Lan-
  guage Engineering, 25(5):561–583.                        Antti Rasmus, Mathias Berglund, Mikko Honkala,
                                                             Harri Valpola, and Tapani Raiko. 2015. Semi-
Tommi Sakari Jauhiainen, Marco Lui, Marcos                   supervised learning with ladder networks. In Ad-
  Zampieri, Timothy Baldwin, and Krister Lindén.            vances in neural information processing systems,
  2019b. Automatic language identification in texts:         pages 3546–3554.
Fernando Sánchez-Vega, Esaú Villatoro-Tello, Manuel
  Montes-y Gómez, Paolo Rosso, Efstathios Sta-
  matatos, and Luis Villaseñor-Pineda. 2019. Para-
  phrase plagiarism identification with character-
  level features. Pattern Analysis and Applications,
  22(2):669–681.
Yves Scherrer and Owen Rambow. 2010. Natural
  language processing for the swiss german dialect
  area. In Semantic Approaches in Natural Language
  Processing-Proceedings of the Conference on Natu-
  ral Language Processing 2010 (KONVENS), pages
  93–102. Universaar.
Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
   2012. Practical bayesian optimization of machine
   learning algorithms. In Advances in neural informa-
   tion processing systems, pages 2951–2959.
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan
   Kiros, Nadathur Satish, Narayanan Sundaram,
   Mostofa Patwary, Mr Prabhat, and Ryan Adams.
   2015. Scalable bayesian optimization using deep
   neural networks. In International conference on ma-
   chine learning, pages 2171–2180.

Liling Tan, Marcos Zampieri, Nikola Ljubešic, and
   Jörg Tiedemann. 2014a. Merging comparable data
   sources for the discrimination of similar languages:
   The dsl corpus collection. In Proceedings of the 7th
  Workshop on Building and Using Comparable Cor-
   pora (BUCC), pages 11–15.

Liling Tan, Marcos Zampieri, Nikola Ljubešic, and
   Jörg Tiedemann. 2014b. Merging comparable data
   sources for the discrimination of similar languages:
   The dsl corpus collection. In Proceedings of the 7th
  Workshop on Building and Using Comparable Cor-
   pora (BUCC), pages 11–15, Reykjavik, Iceland.
Zhihua Wei, Duoqian Miao, Jean-Hugues Chauchat,
  Rui Zhao, and Wen Li. 2009. N-grams based fea-
  ture selection and text representation for chinese text
  classification. International Journal of Computa-
  tional Intelligence Systems, 2(4):365–374.
Jennifer Williams and Charlie Dagli. 2017. Twitter
   language identification of similar languages and di-
   alects without ground truth. In Proceedings of the
  Fourth Workshop on NLP for Similar Languages, Va-
   rieties and Dialects (VarDial), pages 73–83.
Qiuyu Zhu and Ruixin Zhang. 2019. A classifica-
  tion supervised auto-encoder based on predefined
  evenly-distributed class centroids. arXiv preprint
  arXiv:1902.00220.

</pre>