=Paper=
{{Paper
|id=Vol-2624/germeval-task2-paper4
|storemode=property
|title=Idiap Submission to Swiss-German Language Detection Shared Task
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task2-paper4.pdf
|volume=Vol-2624
|authors=Shantipriya Parida,Esaú Villatoro-Tello,Sajit Kumar,Petr Motlicek,Qingran Zhan
|dblpUrl=https://dblp.org/rec/conf/swisstext/ParidaVKMZ20
}}
==Idiap Submission to Swiss-German Language Detection Shared Task==
Idiap Submission to Swiss-German Language Detection Shared Task
Shantipriya Parida1 , Esaú Villatoro-Tello2,1 , Sajit Kumar3 ,
Petr Motlicek1 and Qingran Zhan1
1
Idiap Research Institute, Rue Marconi 19, 1920 Martigny, Switzerland.
firstname.lastname@idiap.ch
2
Universidad Autónoma Metropolitana, Unidad Cuajimalpa, Mexico City, Mexico.
evillatoro@correo.cua.uam.mx
3
Centre of Excellence in AI, Indian Institute of Technology, Kharagpur, West Bengal, India.
kumar.sajit.sk@gmail.com
Abstract It is a difficult task to discriminate between very
close languages or dialects (for example, German
Language detection is a key part of the dialect identification, Indo-Aryan language identifi-
NLP pipeline for text processing. The task cation (Jauhiainen et al., 2019a)). Although dialect
of automatically detecting languages be- identification is commonly based on the distribu-
longing to disjoint groups is relatively easy. tions of letters or letter n-grams, it may not be possi-
It is considerably challenging to detect lan- ble to distinguish related dialects with very similar
guages that have similar origins or dialects. phoneme and grapheme inventories for some lan-
This paper describes Idiap’s submission to guages (Scherrer and Rambow, 2010).
the 2020 Germeval evaluation campaign1 Many authors proposed traditional machine
on Swiss-German language detection. In learning approaches for language detection like
this work, we have given high dimensional Naive Bayes, SVM, word and character n-grams,
features generated from the text data as graph-based n-grams, prediction partial matching
input to a supervised autoencoder for de- (PPM), linear interpolation with post-independent
tecting languages with dialect variances. weight optimization and majority voting for com-
Bayesian optimizer was used to fine-tune bining multiple classifiers, etc. (Jauhiainen et al.,
the hyper-parameters of the supervised au- 2019b).
toencoder. To the best of our knowledge, More recently, deep learning techniques have
we are first to apply supervised autoen- shown substantial performance in many NLP tasks
coder for the language detection task. including language detection (Oro et al., 2018).
In the context of deep learning techniques, many
1 Introduction papers have demonstrated the capability of semi-
The increased usage of smartphones, social me- supervised autoencoders solving different tasks, in-
dia, and the internet has led to rapid growth in the dicating that the use of autoencoders allows learn-
generation of short linguistic texts. Thus, identifi- ing a representation when trained with unlabeled
cation of language is a key component in building data. (Ranzato and Szummer, 2008; Rasmus et al.,
various NLP resources (Kocmi and Bojar, 2017). 2015). However, as per our literature survey, none
Language detection is the task of determining the of the recent research has applied autoencoder for
language for the given text. Although it has pro- the language detection task. In this paper, we
gressed substantially, still few challenges exist: (1) propose a supervised configuration of the autoen-
distinguishing among similar languages, (2) detec- coders, which utilizes labels for learning the repre-
tion of languages when multiple language contents sentation. To the best of our knowledge, this is the
exist within a single document, and (3) language first time this technology is evaluated in the context
identification in very short texts (Balazevic et al., of the language detection task.
2016; Lui et al., 2014; Williams and Dagli, 2017).
1.1 Supervised Autoencoder
Copyright c 2020 for this paper by its authors. Use permitted An autoencoder (AE) is a neural network that learns
under Creative Commons License Attribution 4.0 Interna- a representation (encoding) of input data and then
tional (CC BY 4.0)
1
https://sites.google.com/view/ learns to reconstruct the original input from the
gswid2020 learned representation. The autoencoder is mainly
used for dimensionality reduction or feature extrac- Global optimization is considered to be a chal-
tion (Zhu and Zhang, 2019). Normally, it is used lenging problem of finding the globally best solu-
in an unsupervised learning fashion, meaning that tion of (possibly nonlinear) models, in the (possi-
we leverage the neural network for the task of rep- ble or known) presence of multiple local optima.
resentation learning. By learning to reconstruct the Bayesian optimization (BO) is shown to outper-
input, the AE extracts underlying abstract attributes form other state-of-the-art global optimization algo-
that facilitate accurate prediction of the input. rithms on several challenging optimization bench-
Thus, a supervised autoencoder (SAE) is an au- mark functions (Snoek et al., 2012; Bergstra and
toencoder with the addition of a supervised loss Bengio, 2012). BO provides a principled technique
on the representation layer. For the case of a sin- based on Bayes theorem to direct a search for a
gle hidden layer, a supervised loss is added to the global optimization problem that is efficient and ef-
output layer and for a deeper autoencoder, the in- fective. It works by building a probabilistic model
nermost (smallest) layer would have a supervised of the objective function, called the surrogate func-
loss added to the bottleneck layer that is usually tion, that is then searched efficiently with an acqui-
transferred to the supervised layer after training the sition function before candidate samples are chosen
autoencoder. for evaluation on the real objective function. It tries
In supervised learning, the goal is to learn a to solve the minimization problem:
function for a vector of inputs x ∈ Rd to predict
a vector of targets y ∈ Rm . Consider SAE with X ∗ = arg min f (x), (2)
x∈χ
a single hidden layer of size k, and the weights
for the first layer are F ∈ Rk×d . The function is where we consider χ to be a compact subset of Rk
trained on a finite batch of independent and identi- (Snoek et al., 2015).
cally distributed (i.i.d.) data, (x1 , y1 ), ..., (xt , yt ), Thus, we employed BO for hyperparameter op-
with the goal of a more accurate prediction on timization where the objective is to find the hyper-
new samples generated from the same distribution. parameters of a given machine learning algorithm,
The weight for the output layer consists of weights for this, we preserved the best performance as mea-
Wp ∈ Rm×k to predict y and Wr ∈ Rd×k to re- sured on a validation set.
construct x. Let Lp be the supervised loss and Lr
be the loss for the reconstruction error. In the case 2 Proposed Method
of regression, both losses might be represented by The architecture of the proposed model is shown
a squared error, resulting in the objective: in Figure 1. We used character n-grams as fea-
tures from the input text. In comparison to word
t
1 Xh i n-grams, which only capture the identity of a word
Lp (Wp Fxi , yi ) + Lr (Wr Fxi , xi ) = and its possible neighbors, character n-grams are
t
i=1
additionally capable of providing an excellent trade-
t h
1 X i
off between sparseness and word’s identity, while
||Wp Fxi − yi ||22 + ||Wr Fxi − xi ||22
2t at the same time they combine different types of
i=1
(1) information: punctuation, morphological makeup
of a word, lexicon and even context (Wei et al.,
The addition of supervised loss to the autoen- 2009; Kulmizev et al., 2017; Sánchez-Vega et al.,
coder loss function acts as regularizer and results 2019). The extracted n-gram features are input to
(as shown in equation 1) in the learning of the better the deep SAE as shown in the Figure 1. The deep
representation for the desired task (Le et al., 2018). SAE contains multiple hidden layers. We used the
BO for selecting the optimal parameters.
1.2 Bayesian Optimizer
3 Experimental Setup and Datasets
In the case of SAE, there are many hyperparameters
related to (a) Model construction and (b) Optimiza- The training dataset was provided by the organizers
tion. Hence, SAE training without any hyperparam- of the shared task. The training2 dataset consists of
eter tuning usually results in poor performance due 2,000 tweets in the Swiss-German language. The
to the dependencies that may result in simultaneous 2
Although 2K Twitter ids were provided, we were not able
over/under-fitting. to retrieve them all, resulting in 1976 training instances.
Figure 1: Proposed model architecture. The extracted features of the text are fed to the supervised autoencoder.
The targets “y” are included. The classification output are the language ids for the classified languages.
participants were allowed to use any additional 190,000 sentences categorized into 10 lan-
resources as training datasets. As part of the addi- guages (English, French, Portuguese, Chinese
tional resources recommended by the organizers, Mandarin, Russian, Hebrew, Polish, Japanese,
the following Swiss-German datasets were sug- Italian, Dutch) mainly used for language de-
gested: NOAH 3 (Hollenstein and Aepli, 2015), tection and benchmarking NLP algorithms.
and SwissCrawl 4 (Linder et al., 2019); which we We considered “Ling10-trainlarge” (one of
used in our experiments. the three variants of Ling10 dataset) in our
The test data released by the organizers consists experiment.
of 5,374 Tweets (mix of different languages) to
Group Name Language Id
be classified as Swiss-German versus not Swiss-
South Eastern Slavic Bulgarian bg
German. Macedonian mk
The training dataset provided by the organizer South Western Slavic Bosnian bs
Croatian hr
did not have any non-Swiss-German text. In addi- Serbian sr
tion to the recommended Swiss-German datasets, West-Slavic Czech cz
we have used other non-Swiss-German datasets Slovak sk
Ibero- Peninsular Spain es-ES
(DSL 5 (Tan et al., 2014a), and Ling10 6 ) for train- Romance(Spanish)
ing our models. Argentinian Spanish es-AR
• DSL Dataset: The data obtained from the Ibero- Brazilian Portuguese pt-BR
Romance(Portuguese)
“Discriminating between Similar Language European Portuguese pt-PT
(DSL) Shared Task 2015” contains 13 dif- Astronesian Indonesian id
ferent languages as shown in Table 1. The Malay my
DSL corpus collection have different versions
based on different language group which pro- Table 1: DSL Language Group. Similar languages
vides datasets for researchers to test their sys- with their language code.
tems (Tan et al., 2014a). We selected DSLCC
version 2.0 7 in our experiments (Tan et al., As the task is a binary classification of Swiss-
2014b). German versus not Swiss-German, we have split
• Ling10 Dataset : The Ling10 dataset contains all our collection of datasets including the training
set provided by the organizers into two categories
3
https://noe-eva.github.io/ as follows:
NOAH-Corpus/
4
https://icosys.ch/swisscrawl
• Swiss-German (NOAH, SwissCrawl, Swiss-
5
http://ttg.uni-saarland.de/resources/ German Training Tweets).
DSLCC/ • not Swiss-German (DSL, Ling10).
6
https://github.com/johnolafenwa/ Accordingly, we labeled the target class of all
Ling10
7
https://github.com/Simdiva/DSL-Task/ the Swiss-German text as “gsw” (Swiss-German)
tree/master/data/DSLCC-v2.0 and labeled the target class of all other language
text as “not gsw”).
We prepared three settings (S1, S2, and S3) com-
bining the above datasets in different proportions
of Swiss-German versus not Swiss-German lan-
guages for training the model. The statistics of the
datasets for the settings are shown in Table 2.
We mixed the datasets of Swiss-German and
other languages and split them into different ratios
for training and development as per the settings. In
each setting, the training and development set is
different based on the selection of the number of Confusion matrix for setting S1 on dev set.
sentences from each dataset. We used the test set
provided by the shared task organizers. As the test
set includes twitter text during preprocessing, we
removed emojis and other unnecessary symbols.
The range of values for the hyperparameters
search space is shown in Table 3. During training,
BO chooses the best hyperparameters from this
range. The overall configuration of the SAE model
is shown in Table 4.
4 Results and Discussion Confusion matrix for setting S2 on dev set.
We evaluated the development set performance and
the test set evaluation performed by the shared task
organizers. The development set performance is
given in section Section 4.1 and the test set perfor-
mance in Section 4.2.
Our evaluation includes calculating classification
accuracy based on the predicted label compared
with the actual label. The organizers calculated pre-
cision, average precision, recall, and F1 score for
each of the submissions. As known, precision is the
ratio of correctly predicted positive observations Confusion matrix for setting S3 on dev set.
to the total predicted positive observations; recall
Figure 2: Confusion matrix on the development (dev)
(or sensitivity) is the ratio of correctly predicted
set for the setting S1, S2, and S3. The confusion matrix
positive observations to all observations in actual shows the correct and incorrect predictions with count
positive class, and the F1 score is the weighted values broken down by each class i.e. “gsw” (Swiss-
average of precision and recall. German) or “not gsw” (not Swiss-German).
Organizers also generated the Receiver Operat-
ing Characteristic curve (ROC), Area Under the
ROC Curve (AUC), and Precision-Recall (PR) model is represented by a curve that bows towards
curves. The AUC - ROC curve is a performance (1,1).
measurement at various threshold settings. ROC is
a probability curve and AUC represents the degree 4.1 Development Set
or measure of separability. It indicates how much a The SAE model performance for the three settings
trained model is capable of distinguishing between (S1, S2, and S3) on the development set is shown in
classes, thus, the higher the AUC, the better the Table 5. The confusion matrix for all the settings
model performance. Finally, PR curves summarize on the development set is shown in Figure 2. The
the trade-off between the true positive rate and the confusion matrix shows the correct and incorrect
positive predictive value for a predictive model us- predictions with count values broken down by each
ing different probability thresholds; hence, a good class i.e. “gsw” (Swiss-German) or “not gsw” (not
Setting Datasets and Language Distribution Distribution Training Dev Test
(Overall)
S1 NOAH (Swiss-German) 7,327 (8%) 50% Swiss-German 80,000 20,000 5,374
SwissCrawl (Swiss-German) 40,697 (40%) 50% not Swiss-German
SwissTextTrain (Swiss-German) 1,976 (2 %)
DSL (not Swiss-German) 25,000 (25 %)
Ling10 (not Swiss-German) 25,000 (25 %)
S2 NOAH (Swiss-German) 7,327 (5%) 61% Swiss-German 130,000 20,000 5,374
SwissCrawl (Swiss-German) 81,841 (55 %) 39% not Swiss German
SwissTextTrain (Swiss-German) 1,976 (1 %)
DSL (not Swiss-German) 25,000 (17 %)
Ling10 (not Swiss-German) 33,856 (22 %)
S3 NOAH (Swiss-German) 7,327 (4 %) 46% Swiss-German 180,000 20,000 5,374
SwissCrawl (Swiss-German) 81,841 (41 %) 54% not Swiss-German
SwissTextTrain (Swiss-German) 1,976 (1 %)
DSL (not Swiss-German) 50,000 (25 %)
Ling10 (not Swiss-German) 58,856 (29 %)
Table 2: Dataset Statistics. The training-development-test set distribution for each of setting (S1, S2 and S3). The
distribution is based on the number of sentences selected from the datasets.
Hyper Parameter Range Precision Recall F1
number of layer 1-5 IDIAP 0.775 0.998 0.872
learning rate 10−5 − 10−2 jj-cl-uzh 0.945 0.993 0.968
weight decay 10−6 − 10−3 Mohammadreza 0.984 0.979 0.982
activation functions ‘relu’, ‘sigma’ Banaei
Table 3: Search space hyper parameter range. Table 6: Shared task result announced by the organiz-
ers displaying participant team and their model perfor-
Parameter Value
mance (Precision, Recall, and F1).
char n gram range 1-3
number of target 2
embedding dimension 300 Prec Rec F1 Avg.
Setting AUROC
supervision ‘clf’ (classification) (gsw) (gsw) (gsw) Prec
converge threshold 0.00001 S1 0.649 0.997 0.786 0.871 0.924
number of epochs 500 S2 0.673 0.997 0.804 0.911 0.946
S3 0.775 0.998 0.872 0.965 0.975
Table 4: SAE model configuration used for training.
Table 7: Performance of setting S1, S2, and S3.
Swiss-German).
Accuracy (%) Based on our initial analysis, we presume that
Model Setting Development Set the low performance of the SAE on the test set is
SAE (char-3gram) S1 100 due to the very few samples of twitter data available
SAE (char-3gram) S2 100
SAE (char-3gram) S3 100 in the training data.
Table 5: Swiss-German language detection perfor- 5 Conclusion
mance (classification accuracy) of the proposed model
on the development set based on the setting S1, S2, and In this paper, we have shown the pertinence of SAE
S3. with Bayesian optimizer for the language detection
task. Obtained results are encouraging, and SAE
was found effective for discriminate between very
4.2 Test Set close languages or dialects. The proposed model
The overall result announced by the organizers on can be extended by creating a host of features such
test set is shown in the Table 6 and in the Figure 3. as character n-gram, word n-gram, word counts, etc
Our submission labeled as “IDIAP”, obtained the and then passing it through autoencoder to choose
results 0.777, 0.998, and 0.872 for precision (prec), the best features. In future work, we plan to (i) ver-
recall (rec), and F1 score respectively for the setting ify our model (SAE with BO) with other language
S3 as shown in Table 6. The detailed performance detection datasets, and (ii) include more short texts,
of each of our setting is shown in Table 7. particularly Twitter data, in the training set and
Figure 3: Official results announced by the organizers displaying team’s performance (ROC, PR curves).
verify the performance of our model under a more A survey. Journal of Artificial Intelligence Research,
balanced data type scenario. 65:675–782.
Tom Kocmi and Ondřej Bojar. 2017. Lanidenn: Multi-
Acknowledgments lingual language identification on character window.
In Proceedings of the 15th Conference of the Euro-
The work was supported by an innovation project pean Chapter of the Association for Computational
(under an InnoSuisse grant) oriented to improve Linguistics: Volume 1, Long Papers, pages 927–936.
the automatic speech recognition and natural lan- Artur Kulmizev, Bo Blankers, Johannes Bjerva, Malv-
guage understanding technologies for German. Ti- ina Nissim, Gertjan van Noord, Barbara Plank, and
tle: “SM2: Extracting Semantic Meaning from Martijn Wieling. 2017. The power of character n-
Spoken Material” funding application no. 29814.1 grams in native language identification. In Proceed-
ings of the 12th Workshop on Innovative Use of NLP
IP-ICT and EU H2020 project “Real-time network, for Building Educational Applications, pages 382–
text, and speaker analytics for combating organized 389.
crime” (ROXANNE), grant agreement: 833635.
Lei Le, Andrew Patterson, and Martha White. 2018.
The second author, Esaú Villatoro-Tello is sup- Supervised autoencoders: Improving generalization
ported partially by Idiap, UAM-C Mexico, and performance with unsupervised regularizers. In Ad-
SNI-CONACyT Mexico during the elaboration of vances in Neural Information Processing Systems,
this work. pages 107–117.
Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu
Musat, and Andreas Fischer. 2019. Automatic cre-
References ation of text corpora for low-resource languages
from the internet: The case of swiss german.
Ivana Balazevic, Mikio Braun, and Klaus-Robert
Müller. 2016. Language detection for short Marco Lui, Jey Han Lau, and Timothy Baldwin. 2014.
text messages in social media. arXiv preprint Automatic detection and language identification of
arXiv:1608.08515. multilingual documents. Transactions of the Associ-
ation for Computational Linguistics, 2:27–40.
James Bergstra and Yoshua Bengio. 2012. Random
search for hyper-parameter optimization. Journal of Ermelinda Oro, Massimo Ruffolo, and Mostafa
machine learning research, 13(Feb):281–305. Sheikhalishahi. 2018. Language identification of
similar languages using recurrent neural networks.
Nora Hollenstein and Noëmi Aepli. 2015. A resource In ICAART.
for natural language processing of swiss german di- Marc’Aurelio Ranzato and Martin Szummer. 2008.
alects. Semi-supervised learning of compact document rep-
resentations with deep networks. In Proceedings of
Tommi Jauhiainen, Krister Lindén, and Heidi Jauhi- the 25th international conference on Machine learn-
ainen. 2019a. Language model adaptation for lan- ing, pages 792–799.
guage and dialect identification of text. Natural Lan-
guage Engineering, 25(5):561–583. Antti Rasmus, Mathias Berglund, Mikko Honkala,
Harri Valpola, and Tapani Raiko. 2015. Semi-
Tommi Sakari Jauhiainen, Marco Lui, Marcos supervised learning with ladder networks. In Ad-
Zampieri, Timothy Baldwin, and Krister Lindén. vances in neural information processing systems,
2019b. Automatic language identification in texts: pages 3546–3554.
Fernando Sánchez-Vega, Esaú Villatoro-Tello, Manuel
Montes-y Gómez, Paolo Rosso, Efstathios Sta-
matatos, and Luis Villaseñor-Pineda. 2019. Para-
phrase plagiarism identification with character-
level features. Pattern Analysis and Applications,
22(2):669–681.
Yves Scherrer and Owen Rambow. 2010. Natural
language processing for the swiss german dialect
area. In Semantic Approaches in Natural Language
Processing-Proceedings of the Conference on Natu-
ral Language Processing 2010 (KONVENS), pages
93–102. Universaar.
Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
2012. Practical bayesian optimization of machine
learning algorithms. In Advances in neural informa-
tion processing systems, pages 2951–2959.
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan
Kiros, Nadathur Satish, Narayanan Sundaram,
Mostofa Patwary, Mr Prabhat, and Ryan Adams.
2015. Scalable bayesian optimization using deep
neural networks. In International conference on ma-
chine learning, pages 2171–2180.
Liling Tan, Marcos Zampieri, Nikola Ljubešic, and
Jörg Tiedemann. 2014a. Merging comparable data
sources for the discrimination of similar languages:
The dsl corpus collection. In Proceedings of the 7th
Workshop on Building and Using Comparable Cor-
pora (BUCC), pages 11–15.
Liling Tan, Marcos Zampieri, Nikola Ljubešic, and
Jörg Tiedemann. 2014b. Merging comparable data
sources for the discrimination of similar languages:
The dsl corpus collection. In Proceedings of the 7th
Workshop on Building and Using Comparable Cor-
pora (BUCC), pages 11–15, Reykjavik, Iceland.
Zhihua Wei, Duoqian Miao, Jean-Hugues Chauchat,
Rui Zhao, and Wen Li. 2009. N-grams based fea-
ture selection and text representation for chinese text
classification. International Journal of Computa-
tional Intelligence Systems, 2(4):365–374.
Jennifer Williams and Charlie Dagli. 2017. Twitter
language identification of similar languages and di-
alects without ground truth. In Proceedings of the
Fourth Workshop on NLP for Similar Languages, Va-
rieties and Dialects (VarDial), pages 73–83.
Qiuyu Zhu and Ruixin Zhang. 2019. A classifica-
tion supervised auto-encoder based on predefined
evenly-distributed class centroids. arXiv preprint
arXiv:1902.00220.