=Paper= {{Paper |id=Vol-3681/T4-5 |storemode=property |title=Word-level Language Identification in Code-mixed Tulu Texts |pdfUrl=https://ceur-ws.org/Vol-3681/T4-5.pdf |volume=Vol-3681 |authors=Sushma N,Asha Hegde,Hosahalli Lakshmaiah Shashirekha |dblpUrl=https://dblp.org/rec/conf/fire/NHS23 }} ==Word-level Language Identification in Code-mixed Tulu Texts== https://ceur-ws.org/Vol-3681/T4-5.pdf
                                Word-level Language Identification in Code-mixed
                                Tulu Texts
                                Sushma N, Asha Hegde and Hosahalli Lakshmaiah Shashirekha
                                Department of Computer Science, Mangalore University, Mangalore, Karnataka, India


                                                                      Abstract
                                                                      Word-level Language Identification (LI) is the task of identifying the language of every word within a
                                                                      given multilingual sentence as in the case of code-mixed text. It is an essential pre-processing step for
                                                                      various language dependent applications such as machine translation. Though several research works are
                                                                      available for word-level LI in high-resource languages like Spanish, and French in multilingual context,
                                                                      many under-resourced languages are not yet explored in this direction. ”CoLI-Tunglish: Word-level
                                                                      Language Identification in Code-mixed Tulu Texts” shared task organized at Forum for Information
                                                                      Retrieval Evaluation (FIRE) 2023, invites researchers to develop models to address the challenges of
                                                                      word-level LI in Tulu - an under-resourced Dravidian language. In this paper, we - team MUCS, describe
                                                                      the learning models submitted to this shared task for word-level LI in Tulu. Two distinct models: CoLI-
                                                                      Ensemble - an ensemble of Machine Learning (ML) classifiers (Support Vector Machine (SVM), Random
                                                                      Forest (RF), and Logistic Regression (LR)) with hard voting trained using character n-grams in the range
                                                                      (1, 3) and fastText pre-trained word vectors individually, and CoLI-CRF - a Conditional Random Field
                                                                      (CRF) algorithm trained with text-based features, are proposed for word-level LI in code-mixed Tulu text.
                                                                      Among the proposed models, CoLI-CRF outperformed the other model with a macro F1-score of 0.77
                                                                      securing 4th rank in the shared task.

                                                                      Keywords
                                                                      Language identification, Tulu, Sequence labeling, Machine learning, Word embeddings




                                1. Introduction
                                In multilingual country like India, people are proficient in more than one language and often
                                express themselves using a combination of two or more languages on social media platforms
                                like Twitter, Instagram, Facebook, etc., [1, 2]. This mixture of languages, known as Code-mixing,
                                involves the mixing of words or sub-words of more than one language at the word, phrase,
                                or sentence level, with either a single script or multiple scripts [3]. Despite the availability
                                of various applications that enable entering data in local/native languages, users frequently
                                opt to use Roman script due to the technical limitations of computer keyboards and smart
                                phone keypads, to key in Indian language characters, and the ease of using Roman script for
                                transforming information in a convenient way [4, 5]. This has made Code-mixing a common
                                phenomena especially on social media platforms.
                                   Processing code-mixed text is challenging as it needs the tools/models that could handle
                                multiple languages and multiple scripts in a given text [6]. The majority of the available
                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                Envelope-Open sush.prgm@gmail.com (S. N); hegdekasha@gmail.com (A. Hegde); hlsrekha@mangaloreuniversity.ac.in
                                (H. L. Shashirekha)
                                                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
computational tools and pre-trained models, however, can only support monolingual text,
highlighting the demand for effective tools and models to handle code-mixed text. Further, lack
of digital resources for code-mixed text adds another dimension to the challenges associated
with processing code-mixed text.
   Tulu is an under-resourced language that belongs to the Dravidian language family and is
spoken by more than three million people in the coastal regions of Karnataka and Karnataka-
Kerala border. People who have considered Tulu as their mother tongue are known as Tuluvas
and they are also found in Mumbai, Maharashtra, and many Gulf countries. Tulu language
contains several Kannada words and as Tulu script is not popular, people commonly use
Kannada script to write Tulu text. Further, people specifically those who are active on social
media platforms, use either Kannada or Roman scripts or a combination of both to post their
comments/reviews resulting in code-mixed text.
   Models for Natural Language Processing (NLP) tasks like Machine Translation (MT) [7],
Transliteration [8], Parts-Of-Speech (POS) tagging [9], Named Entity Recognition [10], etc., are
conventionally designed for monolingual text. Using such models for code-mixed text directly
may result in the degraded performance of these models, due to diverse linguistic structures of
code-mixed text. This emphasizes the importance of language detection to ensure the quality of
the applications/algorithms for processing code-mixed text which is multilingual in nature.
   The preliminary step in processing code-mixed text is to identify the language of each word
in a given sentence [11]. Word-level LI in high-resource languages like French and Spanish, and
under-resourced languages like Hindi, Bengali, Tamil, Telugu, Kannada, and Malayalam, have
been explored by many researchers [12, 13, 14, 15]. However, Tulu has never been explored in
this direction, due to the non-availability of datasets and computational tools for this language.
To address the challenges of word-level LI in Tulu, in this paper, we - team MUCS, describe
the learning models submitted to ”CoLI-Tunglish: Word-level Language Identification in Code-
mixed Tulu Texts” shared task organized at FIRE 2023. The aim of this shared task is to develop
learning models to tag one of the six classes, viz., Tulu, English, Kannada, Mixed, Name, Location,
and Other, to each word in a given sentence. This shared task is modeled as a sequence labeling
problem with two distinct models: i) CoLI-Ensemble - an ensemble of ML classifiers (SVM,
RF, and LR) trained separately with Term Frequency-Inverse Document Frequency (TF-IDF) of
character n-grams in the range (1, 3) and fastText word embeddings and ii) CoLI-CRF - a CRF
classifier trained with text-based features, to identify the language of each word.
   The rest of the paper is organized as follows: Section 2 contains Related Work and Section
3 describes the Methodology. While Section 4 gives the description of the Experiments and
Results, the paper concludes with future work in Section 5.


2. Related work
Code-mixing in the context of Indian languages has become the default language of social media
and it has attracted considerable research interest in word-level LI with several notable works
[16]. The following description provides an overview of few Word-level LI works relevant to
the study:
   Shashirekha et al. [15] created a dataset for word-level LI in code-mixed Kannada text with
19,432 unique words and also collected code-mixed Kannada text with 72,815 unique sentences
to build pre-trained models. The authors implemented four distinct models: i) CoLI-ngrams - an
ensemble of three ML classifiers (LR, Linear Support Vector Classifier (Linear SVC), Multilayer
Perceptron (MLP)) with soft voting trained with count vectors of character n-grams obtained
from sub-word tokens, ii) CoLI-vectors - a pre-trained embeddings created considering words,
sub-words, and characters from code-mixed Kannada text and used to train both ML and Deep
Learning (DL) models iii) CoLI-BiLSTM - a DL model trained with CoLI-vectors, and iv) CoLI-
ULMFiT - a Universal Language Model Fine-Tuning (ULMFiT) model pre-trained on raw text
and is fine-tuned with the Train set, for word-level LI in code-mixed Kannada text. Among
the models, CoLI-ngrams model obtained a macro F1-score of 0.64. A code-mixed Telugu-
English dataset with 29,503 tokens is created by Gundapu and Mamidi [14] for word-level LI. To
benchmark their dataset, the authors trained Naïve Bayes (NB) and RF classifiers, with TF-IDF
of characters sequences, and Hidden Markov Model (HMM) and CRF models with text-based
features. Among these models, CRF model obtained a macro F1-score of 0.91.
   Thara and Poornachandran [13] created an annotated corpus of 7,75,430 tokens for word-level
LI in code-mixed Malayalam-English text and implemented a wide range of transformer-based
models (Bidirectional Encoder Representations from Transformers (BERT), distilled version
of BERT (DistilBERT), Enhanced Light Efficiency Cophasing Telescope Resolution Actuator
(ELECTRA), Cross-lingual Language Model Robustly Optimized BERT Approach (XLMRoberta),
and CamemBERT). Among their proposed models, ELECTRA model obtained a macro F1-score of
0.9933. Mandal and Singh [17] proposed a novel approach for word-level LI in code-mixed Bangla
and Hindi texts. Their proposed methodology has two phases: i) implementing Multichannel
Neural Networks (MNN) by combining Convolutional Neural Networks (CNN) and Long Short-
Term Memory (LSTM) and ii) feeding the output of MNN to Bidirectional LSTM+CRF model.
With this approach, they obtained macro F1-scores of 93.49 and 93.32 for code-mixed Bangla
and Hindi texts respectively.
   Veena et al. [12] developed word embeddings as a function of the character embeddings of
the characters present in the word, for word-level LI in code-mixed Tamil and Malayalam texts.
They also used word embeddings of word trigrams and 5-grams as context features for each
word. By training two individual SVM models for each context feature and word embeddings,
the SVM model trained with 5-grams context features achieved macro F1-scores of 91.52 and
94.77 for code-mixed Malayalam and Tamil texts respectively. Barman et al. [18] created a
trilingual (Bengali, English and Hindi) code-mixed dataset with 26,475 tokens for word-level
LI. To benchmark their dataset, the authors trained SVM with TF-IDF of character n-grams
in the range n = (1, 5) and CRF model with text-based features. Their proposed CRF model
outperformed the other model with an accuracy of 95.76%.
   From the available literature, it is clear that researchers have explored character n-grams,
character embeddings, and BERT models to train conventional ML models and the NN models,
for word-level LI in different Dravidian languages. To the best of our knowledge, word-level LI
in code-mixed Tulu text has not been explored so far and Tulu is an under-resourced Dravidian
language. This gives ample scope to explore various algorithms for word-level LI in code-mixed
Tulu text.
3. Methodology
The proposed methodology for word-level LI in Tulu code-mixed texts consists of two models:
i) CoLI-Ensemble and ii) CoLI-CRF. Pre-processing the dataset is not required as the dataset
provided by the shared task organizers is clean and ready to use. The proposed models are
described below:




Figure 1: Framework of the proposed methodology



3.1. CoLI-Ensemble
This model consists of feature extraction followed by classifier construction. Description of
each of these steps are given below:

3.1.1. Feature Extraction
Features play a significant role in deciding the performance of a classifier and the aim of feature
extraction is to extract distinguishable features that can be used to train the learning models.
CoLI-Ensemble models makes use of the following features:

    • Character n-grams - is a sequence of ’n’ characters in a romanized word. As it captures
      the structure of words, it can be conveniently used to represent the words in any romanized
      code-mixed texts. In this work, character n-grams in the range (1, 3) are extracted and
      vectorized using TFIDFVectorizer1 to get the TF-IDF representation.
    • Pre-trained word embeddings - are vector representation of words computed using
      vast amount of text data in any language. These embeddings are language dependent
      and encapsulate both the meaning and structure of words, enabling them to encode
      semantic and syntactic nuances and relationships between words. The only pre-trained
      models available for Tulu are fastText embeddings2 and Byte-Pair Encoding (BPEmb)3
   1
     https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
   2
     https://fasttext.cc/docs/en/pretrained-vectors.html
   3
     https://bpemb.h-its.org/tcy/
          and the vocabulary size of both these pre-trained are very small (Tulu fastText - 7,000
          and BPEmb - 10,000). In this work, Tulu, Kannada (vocabulary size - 1,88,249) and
          English (vocabulary size - 20,00,000) fastText pre-trained word embeddings of size 300 are
          used. The strength of the fastText pre-trained model is its capability to handle sub-word
          information, particularly well-suited for languages with rich morphological structures.
          This strength arises from its use of character n-grams, which enables it to represent and
          understand words even when they share sub-word components with other words.
          Transliteration is a process of converting text from one script or writing system to another,
          preserving the pronunciation or sound of the original text rather than its meaning. As
          the pre-trained models are language dependent and the given dataset is in Roman script,
          all the words are transliterated to Kannada script (it may be noted that Tulu is written in
          Kannada script) using Libindic4 library. With this arrangement, the words in the given
          dataset are available both in Kannada and Roman scripts. The following procedure is
          used to extract the word embeddings:
             – If the word (either in Kannada and Roman script) is present in the vocabulary of
               any one (Kannada, Tulu and English) of the pre-trained models, the embeddings of
               the word is extracted from the corresponding pre-trained model.
             – If the word is present in the vocabulary of more than one pre-trained models, then the
               embedding for that word is considered from the pre-trained model of the language
               to which it belongs in the dataset (ie., tag of that word). As many Kannada words
               are used in Tulu language, there is a chance that some words may be present in both
               Kannada and Tulu vocabularies of the pre-trained models. Similarly many English
               words may be present in Kannada/Tulu vocabularies of the pre-trained models.
             – If the word is not present in the vocabularies of any of the above three pre-trained
               models, such words are considered as Out-Of-Vocabulary words. The embeddings
               of such words is created as a aggregation of character embeddings of the characters
               present in a word in Roman script and English fastText embeddings is used for this
               purpose.

The feature vectors which are obtained from the above feature extraction methods are then
used to train the ensemble of ML classifiers.

3.1.2. Classifier Construction
Ensemble model is a method of generating a new classifier from multiple diverse base classifiers
taking advantage of the strength of one classifier to overcome the weakness of another classifier
with the intention of getting better performance for the classification task [19]. This arrangement
of more than one diverse classifiers is guaranteed to outperform the constituent classifiers in the
ensemble when considered individually. In ensemble models, several classifiers work together
by voting to predict the class label of a sample. The proposed CoLI-Ensemble model ensembles
three ML classifiers (SVM, LR, and RF) with hard voting. Description of the classifiers used in
this model is given below:

    4
        https://github.com/libindic/indic-trans
Table 1
Hyperparameters and their values used in CoLI-Ensemble model
                                Model Name          Hyperparameter and values
                                      SVM               class_weight=’balanced’
                                                            n_estimators=100,
                                       RF                   max_depth=None,
                                                               n_jobs=-1
                                       LR                           -


    • SVM - is an ML classifier primarily designed for binary classification tasks To apply
      SVM to multiclass classification, common strategies like One-vs-Rest and One-vs-One
      are employed. One-vs-Rest trains multiple binary classifiers, one for each class, while
      One-vs-One trains pairwise classifiers for all possible class combinations, allowing SVM
      to effectively handle multiclass classification by reducing it to a series of binary decisions.
    • RF - is an ensemble learning method that constructs multiple decision trees during
      training. Each tree is built independently and then their predictions are combined to yield
      a more accurate and robust overall prediction. By aggregating the outputs of numerous
      individual trees, RF reduces overfitting and enhances the model’s performance [20].
    • LR - is a ML algorithm specifically designed for binary classification tasks. Similar to
      SVM, multi-class classification in LR is approached through one-vs-rest scheme, where
      separate binary classifiers are trained for each class.

Hyperparameters and their values used to train SVM, RF, and LR in the CoLI-Ensemble model
are given in Table 1 and default values are used for rest of the hyperparameters.

3.2. CoLI-CRF
Given the sequences of observations (words) in a sentence, CRF models the conditional proba-
bility distribution of tags (e.g., POS tags and Named Entity (NE) tags). CRF’s strength lies in its
ability to capture dependencies between tags considering both the preceding and succeeding
observations, allowing it to make context-aware predictions in tagging tasks (e.g. POS tags and
NE tags) [21]. For large and structured tag sets, CRF works well with many features that are mu-
tually dependent. In this work, CRFSuite is implemented using sklearn_crfsuite5 library, which
acts as a wrapper for CRF implementation. This library simplifies the classifier construction
process by wrapping the transformation of textual features into feature vectors and training
the CRF classifier. Features used to train the CRF classifier in the proposed CoLI-CRF model are
shown in Table 2.




   5
       https://sklearn-crfsuite.readthedocs.io/en/latest/
Table 2
Features used in CoLI-CRF model
                                                 Features
                   A word                                           Previous word-2
                   Length of word                                   Previous word-3
                   Is the word at the beginning of the sentence     Previous word-4
                   Is the word at the end of the sentence           Next word+2
                   Is current word digit                            Next word+3
                   Is current word punctuation                      Next word+4

Table 3
Class-wise distribution of CoLI-Tunglish dataset
                                    Category       # of Comments
                                       Tulu                 8,647
                                     English                5,499
                                    Kannada                 2,068
                                      Name                  1,104
                                      Other                   506
                                      Mixed                   403
                                    Location                  369

Table 4
Sample words and the corresponding labels in CoLI-Tunglish dataset
           Category     Description                                        Samples
           Name         Name of a person                                   shivam, ayyapa
           Location     Indicates the location                             kudla, udupi
           English      Pure English words                                 Sir, super
           Tulu         Tulu words in Roman script                         Apundu, pura
           Kannada      Kannada words in Roman script                      visaya, anna
           Mixed        Combination of Kannada, Tulu and/ or English       vedion, photoga
           Other        Words not belonging to any categories              git, mujhe


4. Experiments and Results
The CoLI-Tunglish dataset contains code-mixing of three languages (Tulu, Kannada, and English)
in Roman script for the purpose of word-level LI and consists of seven labels (Tulu, Kannada,
English, Mixed, Name, Location, and Other). Label distribution of the CoLI-Tunglish dataset is
given in Table 3 and the sample words from the dataset and the corresponding labels followed
by their descriptions are given in Table 4.
Table 5
Performance of the proposed models
                                          Development set                        Test set
      Model           Features
                                     Precision   Recall   F1-score   Precision    Recall    F1-score
                     Character
                                       0.86       0.65      0.69       0.86        0.57       0.63
 CoLI-Ensemble        n-grams
                      fastText
                        word           0.85       0.84      0.83       0.87        0.69       0.75
                    embeddings
    CoLI-CRF        Text features      0.86       0.84      0.87       0.80        0.74      0.77


   Several experiments were conducted with various feature sets (Tulu BPEmb, OOV embeddings,
and a combination of these embeddings with Tulu fastText embeddings and textual features)
to train a wide range of ML classifiers (SVM, LR, RF, k-NN, MLP, DT, and CRF). Models that
exhibited better performances for the Development set are evaluated on the Test set and the
performance of the proposed models for the Development and the Test sets are shown in
Table 5. The results indicate that CoLI-CRF model has exhibited better macro F1-score than
the CoLI-Ensemble model due to the ability of CRF model to capture the context. However,
CoLI-Ensemble model trained with the feature vectors extracted from the fastText pre-trained
word embeddings has exhibited slightly lesser macro F1-score than that of CoLI-CRF model due
to the small vocabulary size of Tulu.


5. Conclusion
This paper describes the models submitted by our team - MUCS, to ”CoLI-Tunglish: Word-level
Language Identification in Code-mixed Tulu Texts” shared task at FIRE 2023, for word-level
LI in code-mixed Tulu texts. Two distinct models: i) CoLI-Ensemble - a model that adopts
ensembling of ML classifiers (SVM, RF, and LR) trained with TF-IDF of character n-grams
in the range (1, 3) and fastText word embeddings separately and ii) CoLI-CRF - a CRF model
trained with text-based features, are proposed for word-level LI in code-mixed Tulu text. Among
the proposed models, CoLI-CRF model achieved a macro F1-score of 0.77 for word-level LI in
code-mixed Tulu text securing 4th rank in the shared task.


References
 [1] C. M. Scotton, The Possibility of Code-Switching: Motivation for Maintaining Multilin-
     gualism, in: Anthropological linguistics, JSTOR, 1982, pp. 432–444.
 [2] A. Hegde, H. L. Shashirekha, Leveraging Dynamic Meta Embedding for Sentiment Analysis
     and Detection of Homophobic/Transphobic Content in Code-mixed Dravidian Languages,
     2022.
 [3] S. H. Lakshmaiah, F. Balouchzahi, M. D. Anusha, G. Sidorov, CoLI-Machine Learning
     Approaches for Code-Mixed Language Identification at the Word Level in Kannada-English
     Texts, in: Acta Polytechnica Hungarica, 10, 2022, pp. 123–141.
 [4] A. Hande, R. Priyadharshini, B. R. Chakravarthi, KanCMD: Kannada CodeMixed Dataset
     for Sentiment Analysis and Offensive Language Detection, in: Proceedings of the Third
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s
     in Social Media, 2020, pp. 54–63.
 [5] F. Balouchzahi, H. Shashirekha, LA-SACo: A Study of Learning Approaches for Sentiments
     Analysis in Code-Mixing Texts, in: Proceedings of the First Workshop on Speech and
     Language Technologies for Dravidian Languages, 2021, pp. 109–118.
 [6] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly, J. P.
     McCrae, DravidianCodeMix: Sentiment Analysis and Offensive Language Identification
     Dataset for Dravidian Languages in Code-Mixed Text, in: Language Resources and
     Evaluation, Springer, 2022, pp. 765–806.
 [7] A. Hegde, S. Lakshmaiah, Mucs@ mixmt: Indictrans-based Machine Translation for
     Hinglish Text, in: Proceedings of the Seventh Conference on Machine Translation (WMT),
     2022, pp. 1131–1135.
 [8] D. K. Sharma, A. Singh, A. Saroha, Language Identification for Hindi Language Transliter-
     ated Text in Roman Script using Generative Adversarial Networks, in: Towards Extensible
     and Adaptable Methods in Computing, Springer, 2018, pp. 267–279.
 [9] K. Ball, D. Garrette, Part-of-speech Tagging for Code-switched, Transliterated Texts
     without Explicit Language Identification, in: Proceedings of the 2018 conference on
     empirical methods in natural language processing, 2018, pp. 3084–3089.
[10] R. Priyadharshini, B. R. Chakravarthi, M. Vegupatti, J. P. McCrae, Named Entity Recog-
     nition for Code-mixed Indian Corpus using Meta Embedding, in: 2020 6th international
     conference on advanced computing and communication systems (ICACCS), IEEE, 2020,
     pp. 68–72.
[11] D. Nguyen, A. S. Doğruöz, Word Level Language Identification in Online Multilingual
     Communication, in: Proceedings of the 2013 conference on empirical methods in natural
     language processing, 2013, pp. 857–862.
[12] P. Veena, M. A. Kumar, K. Soman, An Effective way of Word-Level Language Identification
     for Code-Mixed facebook Comments using Word-Embedding via Character-Embedding,
     in: 2017 International Conference on Advances in Computing, Communications and
     Informatics (ICACCI), IEEE, 2017, pp. 1552–1556.
[13] S. Thara, P. Poornachandran, Transformer based Language Identification for Malayalam-
     English Code-Mixed Text, in: IEEE Access, IEEE, 2021, pp. 118837–118850.
[14] S. Gundapu, R. Mamidi, Word Level Language Identification in English Telugu Code Mixed
     Data, in: arXiv preprint arXiv:2010.04482, 2020.
[15] H. L. Shashirekha, F. Balouchzahi, M. D. Anusha, G. Sidorov, CoLI-Machine Learning
     Approaches for Code-mixed Language Identification at the Word Level in Kannada-English
     Texts, in: arXiv preprint arXiv:2211.09847, 2022.
[16] A. Jamatia, A. Das, B. Gambäck, Deep Learning-based Language Identification in English-
     Hindi-Bengali Code-mixed Social Media Corpora, De Gruyter, 2019, pp. 399–408.
[17] S. Mandal, A. K. Singh, Language Identification in Code-Mixed Data using Multichannel
     Neural Networks and Context Capture, in: arXiv preprint arXiv:1808.07118, 2018.
[18] U. Barman, A. Das, J. Wagner, J. Foster, Code Mixing: A Challenge for Language Iden-
     tification in the Language of Social Media, in: Proceedings of the first workshop on
     computational approaches to code switching, 2014, pp. 13–23.
[19] A. Hegde, H. L. Shashirekha, Urdu Fake News Detection Using Ensemble of Machine
     Learning Models, in: CEUR Workshop Proceedings, 2021, pp. 132–141.
[20] H. Jhamtani, S. K. Bhogi, V. Raychoudhury, Word-Level Language Identification in Bi-
     Lingual Code-Switched Texts, in: Proceedings of the 28th Pacific Asia Conference on
     language, information and computing, 2014, pp. 348–357.
[21] Machine Learning Approaches for Amharic Parts-of-Speech Tagging, in: arXiv preprint
     arXiv:2001.03324, 2020.