1 Introduction

A survey. Journal of Artificial Intelligence Research

Idiap Submission to Swiss-German Language Detection Shared Task

Shantipriya Parida

Esau´ Villatoro-Tello

1 2

Sajit Kumar

kumar.sajit.sk@gmail.com 0

Petr Motlicek

Qingran Zhan

1 0 Centre of Excellence in AI, Indian Institute of Technology , Kharagpur, West Bengal , India 1 Idiap Research Institute , Rue Marconi 19, 1920 Martigny , Switzerland 2 Universidad Aut o ́noma Metropolitana, Unidad Cuajimalpa , Mexico City , Mexico

2017

1 927 936

Language detection is a key part of the NLP pipeline for text processing. The task of automatically detecting languages belonging to disjoint groups is relatively easy. It is considerably challenging to detect languages that have similar origins or dialects. This paper describes Idiap's submission to the 2020 Germeval evaluation campaign1 on Swiss-German language detection. In this work, we have given high dimensional features generated from the text data as input to a supervised autoencoder for detecting languages with dialect variances. Bayesian optimizer was used to fine-tune the hyper-parameters of the supervised autoencoder. To the best of our knowledge, we are first to apply supervised autoencoder for the language detection task.

1 Introduction

The increased usage of smartphones, social media, and the internet has led to rapid growth in the generation of short linguistic texts. Thus, identification of language is a key component in building various NLP resources (Kocmi and Bojar, 2017) . Language detection is the task of determining the language for the given text. Although it has progressed substantially, still few challenges exist: (1) distinguishing among similar languages, (2) detection of languages when multiple language contents exist within a single document, and (3) language identification in very short texts (Balazevic et al., 2016; Lui et al., 2014; Williams and Dagli, 2017) .

It is a difficult task to discriminate between very close languages or dialects (for example, German dialect identification, Indo-Aryan language identification (Jauhiainen et al., 2019a) ). Although dialect identification is commonly based on the distributions of letters or letter n-grams, it may not be possible to distinguish related dialects with very similar phoneme and grapheme inventories for some languages (Scherrer and Rambow, 2010) .

Many authors proposed traditional machine learning approaches for language detection like Naive Bayes, SVM, word and character n-grams, graph-based n-grams, prediction partial matching (PPM), linear interpolation with post-independent weight optimization and majority voting for combining multiple classifiers, etc. (Jauhiainen et al., 2019b) .

More recently, deep learning techniques have shown substantial performance in many NLP tasks including language detection (Oro et al., 2018). In the context of deep learning techniques, many papers have demonstrated the capability of semisupervised autoencoders solving different tasks, indicating that the use of autoencoders allows learning a representation when trained with unlabeled data. (Ranzato and Szummer, 2008; Rasmus et al., 2015). However, as per our literature survey, none of the recent research has applied autoencoder for the language detection task. In this paper, we propose a supervised configuration of the autoencoders, which utilizes labels for learning the representation. To the best of our knowledge, this is the first time this technology is evaluated in the context of the language detection task.

1.1 Supervised Autoencoder

An autoencoder (AE) is a neural network that learns a representation (encoding) of input data and then learns to reconstruct the original input from the learned representation. The autoencoder is mainly used for dimensionality reduction or feature extraction (Zhu and Zhang, 2019) . Normally, it is used in an unsupervised learning fashion, meaning that we leverage the neural network for the task of representation learning. By learning to reconstruct the input, the AE extracts underlying abstract attributes that facilitate accurate prediction of the input.

Thus, a supervised autoencoder (SAE) is an autoencoder with the addition of a supervised loss on the representation layer. For the case of a single hidden layer, a supervised loss is added to the output layer and for a deeper autoencoder, the innermost (smallest) layer would have a supervised loss added to the bottleneck layer that is usually transferred to the supervised layer after training the autoencoder.

In supervised learning, the goal is to learn a function for a vector of inputs x 2 Rd to predict a vector of targets y 2 Rm. Consider SAE with a single hidden layer of size k, and the weights for the first layer are F 2 Rk d. The function is trained on a finite batch of independent and identically distributed (i.i.d.) data, (x1; y1); :::; (xt; yt); with the goal of a more accurate prediction on new samples generated from the same distribution. The weight for the output layer consists of weights Wp 2 Rm k to predict y and Wr 2 Rd k to reconstruct x. Let Lp be the supervised loss and Lr be the loss for the reconstruction error. In the case of regression, both losses might be represented by a squared error, resulting in the objective: 1 2t

t 1 X h t i=1 t X h i=1 i Lp(WpFxi; yi) + Lr(WrFxi; xi) = jjWpFxi yijj22 + jjWrFxi xijj2 2i (1)

The addition of supervised loss to the autoencoder loss function acts as regularizer and results (as shown in equation 1) in the learning of the better representation for the desired task (Le et al., 2018). 1.2

Bayesian Optimizer

In the case of SAE, there are many hyperparameters related to (a) Model construction and (b) Optimization. Hence, SAE training without any hyperparameter tuning usually results in poor performance due to the dependencies that may result in simultaneous over/under-fitting.

Global optimization is considered to be a challenging problem of finding the globally best solution of (possibly nonlinear) models, in the (possible or known) presence of multiple local optima. Bayesian optimization (BO) is shown to outperform other state-of-the-art global optimization algorithms on several challenging optimization benchmark functions (Snoek et al., 2012; Bergstra and Bengio, 2012) . BO provides a principled technique based on Bayes theorem to direct a search for a global optimization problem that is efficient and effective. It works by building a probabilistic model of the objective function, called the surrogate function, that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function. It tries to solve the minimization problem:

X = arg min f (x); x2 (2) where we consider to be a compact subset of Rk (Snoek et al., 2015) .

Thus, we employed BO for hyperparameter optimization where the objective is to find the hyperparameters of a given machine learning algorithm, for this, we preserved the best performance as measured on a validation set. 2

Proposed Method

The architecture of the proposed model is shown in Figure 1. We used character n-grams as features from the input text. In comparison to word n-grams, which only capture the identity of a word and its possible neighbors, character n-grams are additionally capable of providing an excellent tradeoff between sparseness and word’s identity, while at the same time they combine different types of information: punctuation, morphological makeup of a word, lexicon and even context (Wei et al., 2009; Kulmizev et al., 2017; Sa´nchez-Vega et al., 2019) . The extracted n-gram features are input to the deep SAE as shown in the Figure 1. The deep SAE contains multiple hidden layers. We used the BO for selecting the optimal parameters. 3

Experimental Setup and Datasets

The training dataset was provided by the organizers of the shared task. The training2 dataset consists of 2,000 tweets in the Swiss-German language. The 2Although 2K Twitter ids were provided, we were not able to retrieve them all, resulting in 1976 training instances. participants were allowed to use any additional resources as training datasets. As part of the additional resources recommended by the organizers, the following Swiss-German datasets were suggested: NOAH 3 (Hollenstein and Aepli, 2015) , and SwissCrawl 4(Linder et al., 2019); which we used in our experiments.

The test data released by the organizers consists of 5,374 Tweets (mix of different languages) to be classified as Swiss-German versus not SwissGerman.

The training dataset provided by the organizer did not have any non-Swiss-German text. In addition to the recommended Swiss-German datasets, we have used other non-Swiss-German datasets (DSL 5 (Tan et al., 2014a) , and Ling10 6) for training our models.

DSL Dataset: The data obtained from the “Discriminating between Similar Language (DSL) Shared Task 2015” contains 13 different languages as shown in Table 1. The DSL corpus collection have different versions based on different language group which provides datasets for researchers to test their systems (Tan et al., 2014a) . We selected DSLCC version 2.0 7 in our experiments (Tan et al., 2014b) .

Ling10 Dataset : The Ling10 dataset contains 3https://noe-eva.github.io/ NOAH-Corpus/ 4https://icosys.ch/swisscrawl 5http://ttg.uni-saarland.de/resources/ DSLCC/

6https://github.com/johnolafenwa/ Ling10

7https://github.com/Simdiva/DSL-Task/ tree/master/data/DSLCC-v2.0 Group Name South Eastern Slavic South Western Slavic West-Slavic IberoRomance(Spanish) IberoRomance(Portuguese)

Astronesian

As the task is a binary classification of SwissGerman versus not Swiss-German, we have split all our collection of datasets including the training set provided by the organizers into two categories as follows:

Swiss-German (NOAH, SwissCrawl, SwissGerman Training Tweets).

not Swiss-German (DSL, Ling10).

Accordingly, we labeled the target class of all the Swiss-German text as “gsw” (Swiss-German) and labeled the target class of all other language 190,000 sentences categorized into 10 languages (English, French, Portuguese, Chinese Mandarin, Russian, Hebrew, Polish, Japanese, Italian, Dutch) mainly used for language detection and benchmarking NLP algorithms. We considered “Ling10-trainlarge” (one of the three variants of Ling10 dataset) in our experiment.

Id bg mk bs hr sr cz sk es-ES es-AR pt-BR Language Bulgarian Macedonian Bosnian Croatian Serbian Czech Slovak Peninsular Spain Argentinian Spanish Brazilian Portuguese European Portuguese pt-PT Indonesian id Malay my text as “not gsw”).

We prepared three settings (S1, S2, and S3) combining the above datasets in different proportions of Swiss-German versus not Swiss-German languages for training the model. The statistics of the datasets for the settings are shown in Table 2.

We mixed the datasets of Swiss-German and other languages and split them into different ratios for training and development as per the settings. In each setting, the training and development set is different based on the selection of the number of sentences from each dataset. We used the test set provided by the shared task organizers. As the test set includes twitter text during preprocessing, we removed emojis and other unnecessary symbols.

The range of values for the hyperparameters search space is shown in Table 3. During training, BO chooses the best hyperparameters from this range. The overall configuration of the SAE model is shown in Table 4. 4

Results and Discussion

We evaluated the development set performance and the test set evaluation performed by the shared task organizers. The development set performance is given in section Section 4.1 and the test set performance in Section 4.2.

Our evaluation includes calculating classification accuracy based on the predicted label compared with the actual label. The organizers calculated precision, average precision, recall, and F1 score for each of the submissions. As known, precision is the ratio of correctly predicted positive observations to the total predicted positive observations; recall (or sensitivity) is the ratio of correctly predicted positive observations to all observations in actual positive class, and the F1 score is the weighted average of precision and recall.

Organizers also generated the Receiver Operating Characteristic curve (ROC), Area Under the ROC Curve (AUC), and Precision-Recall (PR) curves. The AUC - ROC curve is a performance measurement at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It indicates how much a trained model is capable of distinguishing between classes, thus, the higher the AUC, the better the model performance. Finally, PR curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds; hence, a good Confusion matrix for setting S1 on dev set. Confusion matrix for setting S2 on dev set.

Confusion matrix for setting S3 on dev set. model is represented by a curve that bows towards (1,1). The SAE model performance for the three settings (S1, S2, and S3) on the development set is shown in Table 5. The confusion matrix for all the settings on the development set is shown in Figure 2. The confusion matrix shows the correct and incorrect predictions with count values broken down by each class i.e. “gsw” (Swiss-German) or “not gsw” (not recall (rec), and F1 score respectively for the setting S3 as shown in Table 6. The detailed performance of each of our setting is shown in Table 7. IDIAP jj-cl-uzh Mohammadreza

Banaei

Precision 0.775 0.945 0.984 In this paper, we have shown the pertinence of SAE with Bayesian optimizer for the language detection task. Obtained results are encouraging, and SAE was found effective for discriminate between very close languages or dialects. The proposed model can be extended by creating a host of features such as character n-gram, word n-gram, word counts, etc and then passing it through autoencoder to choose the best features. In future work, we plan to (i) verify our model (SAE with BO) with other language detection datasets, and (ii) include more short texts, particularly Twitter data, in the training set and verify the performance of our model under a more balanced data type scenario.

Acknowledgments

The work was supported by an innovation project (under an InnoSuisse grant) oriented to improve the automatic speech recognition and natural language understanding technologies for German. Title: “SM2: Extracting Semantic Meaning from Spoken Material” funding application no. 29814.1 IP-ICT and EU H2020 project “Real-time network, text, and speaker analytics for combating organized crime” (ROXANNE), grant agreement: 833635. The second author, Esa u´ Villatoro-Tello is supported partially by Idiap, UAM-C Mexico, and SNI-CONACyT Mexico during the elaboration of this work. Ermelinda Oro, Massimo Ruffolo, and Mostafa Sheikhalishahi. 2018. Language identification of similar languages using recurrent neural networks.

In ICAART.

Marc’Aurelio Ranzato and Martin Szummer. 2008.

Semi-supervised learning of compact document representations with deep networks. In Proceedings of the 25th international conference on Machine learning, pages 792–799.

Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. 2015. Semisupervised learning with ladder networks. In Advances in neural information processing systems, pages 3546–3554.

Ivana

Balazevic , Mikio Braun, and Klaus-Robert Mu ¨ller. 2016 . Language detection for short text messages in social media . arXiv preprint arXiv:1608 . 08515 .

James

Bergstra and

Yoshua

Bengio . 2012 . Random search for hyper-parameter optimization . Journal of machine learning research , 13 (Feb): 281 - 305 .

Nora

Hollenstein and Noe¨mi Aepli. 2015 . A resource for natural language processing of swiss german dialects .

Tommi

Jauhiainen , Krister Linde´n, and Heidi Jauhiainen. 2019a . Language model adaptation for language and dialect identification of text . Natural Language Engineering , 25 ( 5 ): 561 - 583 .

Tommi

Sakari

Jauhiainen , Marco Lui, Marcos Zampieri, Timothy Baldwin, and Krister Linde´n. 2019b. Automatic language identification in texts: Fernando Sa´nchez- Vega , Esau´ Villatoro-Tello, Manuel Montes-y Go´mez, Paolo Rosso, Efstathios Stamatatos, and Luis Villasen˜or- Pineda . 2019 . Paraphrase plagiarism identification with characterlevel features . Pattern Analysis and Applications , 22 ( 2 ): 669 - 681 .

Yves

Scherrer and

Owen

Rambow . 2010 . Natural language processing for the swiss german dialect area . In Semantic Approaches in Natural Language Processing-Proceedings of the Conference on Natural Language Processing 2010 (KONVENS) , pages 93 - 102 . Universaar.

Jasper

Snoek , Hugo Larochelle, and Ryan P Adams. 2012 . Practical bayesian optimization of machine learning algorithms . In Advances in neural information processing systems , pages 2951 - 2959 .

Jasper

Snoek , Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat,

and Ryan

Adams . 2015 . Scalable bayesian optimization using deep neural networks . In International conference on machine learning , pages 2171 - 2180 .

Liling

Tan , Marcos Zampieri, Nikola Ljubesˇic, and Jo¨rg Tiedemann. 2014a. Merging comparable data sources for the discrimination of similar languages: The dsl corpus collection . In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC) , pages 11 - 15 .

Liling

Tan , Marcos Zampieri, Nikola Ljubesˇic, and Jo¨rg Tiedemann. 2014b. Merging comparable data sources for the discrimination of similar languages: The dsl corpus collection . In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC) , pages 11 - 15 , Reykjavik, Iceland.

Zhihua

Wei , Duoqian Miao, Jean-Hugues

Chauchat

Rui

Zhao , and

Wen

Li . 2009 . N-grams based feature selection and text representation for chinese text classification . International Journal of Computational Intelligence Systems , 2 ( 4 ): 365 - 374 .

Jennifer

Williams and

Charlie

Dagli . 2017 . Twitter language identification of similar languages and dialects without ground truth . In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) , pages 73 - 83 .

Qiuyu

Zhu and

Ruixin

Zhang . 2019 . A classification supervised auto-encoder based on predefined evenly-distributed class centroids . arXiv preprint arXiv: 1902 .00220.