=Paper= {{Paper |id=Vol-2826/T4-19 |storemode=property |title=UMSNH-INFOTEC@Dravidian-CodeMix-FIRE2020: An ensemble approach based on a multiple text representations |pdfUrl=https://ceur-ws.org/Vol-2826/T4-19.pdf |volume=Vol-2826 |authors=José Ortiz-Bejar,Jesus Ortiz-Bejar,Jaime Cerda-Jacabo,Mario Graff,Eric S. Tellez |dblpUrl=https://dblp.org/rec/conf/fire/Ortiz-BejarOCGT20 }} ==UMSNH-INFOTEC@Dravidian-CodeMix-FIRE2020: An ensemble approach based on a multiple text representations== https://ceur-ws.org/Vol-2826/T4-19.pdf
UMSNH-INFOTEC@Dravidian-CodeMix-FIRE2020:
An ensemble approach based on a multiple text
representations
José Ortiz-Bejara , Jesus Ortiz-Bejara , Jaime Cerda-Jacaboa , Mario Graffb,c and
Eric S. Tellezb,c
a
  Universidad Michoacana de San Nicolás de Hidalgo, México, Michoacán, México
b
  CONACyT Consejo Nacional de Ciencia y Tecnología,
Dirección de Cátedras, México
c
  INFOTEC Centro de Investigación e Innovación en Tecnologías
de la Información y Comunicación, México


                                         Abstract
                                         This manuscript describes UMSNH-INFOTEC’s participation in the first Sentiment Analysis in Dravidian
                                         Code-Mixed text task on FIRE 2020. Our solution combines several models that solve the task separately;
                                         we then construct a final decision through differential evolution and the linear combination of models’
                                         independently computed decision-values. The generic text categorization system 𝜇TC achieves the best
                                         performance when a single model is used, and our combined output improves individual performances.

                                         Keywords
                                         Code-Mixed, Word-embeddings, Sentiment Analysis, 𝜇TC




1. Introduction
One of the most common tasks involving natural language processing is the so-called Sentiment
Analysis (SA). The main objective is to identify the feelings/intentions from a given text. The
primary task is identifying the polarity of a text, i.e., if it is positive, negative, or neutral. Despite
this simple definition, the task becomes challenging due to small context, errors, negations
sentences, polysemy, and figurative language, among other language characteristics. From a
machine learning perspective, the first step is to use a text model to transform messages into a
vector space, and then, the points in this vector space are used to train a classification model
for some specific task. In addition to the labels, the text model contains most of the language
and domain knowledge that yields a successful classification.
  Text models range from representing each message from sparse word frequency-based to
semantic embeddings obtained by deep neural network-based approaches. While frequency-
based are fitted only using the inputs text for the task, the word embeddings can be pre-trained
over a large text collection. Embedding models can be fit for some specific dataset; this process

FIRE 2020: Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India
email: jortiz@umich.mx (J. Ortiz-Bejar); jesus.ortiz@umich.mx (J. Ortiz-Bejar); jcerda@umich.mx
(J. Cerda-Jacabo); mario.graff@infotec.mx (M. Graff); eric.tellez@infotec.mx (E. S. Tellez)
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                 Mixed_feelings      Negative    Positive     not-malayalam     unknown_state
   Training             289            549         2022            647                1344
 Development             44             51         224             60                 161
     Test                70            138          565            177                 398

Table 1
Malayalam-English data description


                    Mixed_feelings     Negative    Positive     not-tamil     unknown_state
      Training            1283            1448        7627         368              609
    Development           141             165         857          29                68
        Test               377             424        2075         100              173

Table 2
Tamil-English data description


requires a high amount of data and time, and therefore many times, embedding models are used
as text transformers.
   Even though there is a wide variety of pre-trained embeddings for many languages, the ones
with few speakers account with limited resources. In this context, the frequency-based model
exhibits a competitive performance against deep learning-based approaches. For our solution,
we include multiple embeddings from the flair library [1].
   The rest of the paper is organized as follow: first, the Dravidian Code-Mixed tasks are
described briefly, at Sections 2. In Sections 3 and 4 basic ideas and models used for our approach
are introduced. In Section 5, the core of our method is described. Section 6 discuss our results at
the development phase; the final ranks are also presented. Finally, at Section 8 some conclusions
are exposed.


2. Task description
The task is a sentiment analysis problem. This task has the particularity that code-mixed texts
are part of the corpus data. In this context, the code-mixing term is used to defined texts which
are non fully written in its native scripts. The task consists of two sentiment polarity problems
for texts written in Malayalam-English and Tamil-English. The process of generating both
corpora is introduced in [2, 3, 4, 5, 6].
  Both Malayalam and Tamil datasets are split into training, development, and test collections.
Messages are labeled with five polarity levels: Mixed_feelings, Negative, Positive, not-malayalam
and unknown_state. Table 1 shows the label’s distribution for Malayalam dataset; Table 2
describes the same information for Tamil dataset.
3. A language independent approach
At its basis, a SA task may be posed as a classification problem. For this case, a set of texts 𝑇 and a
set of labels Θ are given as input to fit model 𝑓 capable of predicting the class for new input texts.
The classification problem can be tackled in agnostic way by learning the relationship between
the input 𝑇 and the output Θ i.e. the model 𝑓 (ℒ (𝑡𝑖 )) = 𝜃𝑖 where each label 𝜃𝑖 ∈ Θ is the class label
(output) for the 𝑡𝑖 ∈ 𝑇 is only learned from the input set. Such a model 𝑓 works over a transformed
𝑡𝑖 , where the transformation is performed by ℒ. The possible text transformations are vast;
therefore, they should be adapted for a specific task, input data, and classification strategy. The
selection of the best transformation can be performed by hyper-parameter optimization. A
state-of-the-art language-independent approach is 𝜇TC [7]. 𝜇TC optimizes ℒ by exploring
a configuration space comprised of different text transformations, tokenizers, and weighting
schemes. The exploration is lead by a meta-heuristic and aims to produce effective configuration
for Support Vector Machine Classifiers (SVM). As our knowledge about Dravidian dialects is
limited, and the input data are in multiple languages, a language-independent approach is best
suited for the proposed solution. Our solution aims to integrate a language-independent model
enriched by pre-trained embedding models.
     For convenience we assume that the output 𝑓 (ℒ (𝑡𝑖 )) = [𝑝1 , 𝑝2 , ..., 𝑝𝑚 ] is a decision function
in a vector form of size 𝑚, where 𝑚 is the number of different labels in Θ. Then, each 𝑝𝑘 is the
score for 𝑡𝑖 belonging to class 𝑘 and the class for 𝑡𝑖 is the 𝑘 where 𝑝𝑘 is maximum.


4. Pre-trained Embeddings
Since its introduction by Tomas Mikolov [8], pre-trained word embeddings are often used to
train models when there are not large amounts of data for a given task. Even though classic
embeddings have limited capacity on learning rare words, Character embeddings [9], and
Pair-Byte embeddings [10] come to alleviate this situation. While embeddings like Word2Vec[8]
and Glove[11] learn from prefix/suffix at the sub-word level, character-based embeddings learn
representations specific to the task and domain. On the other hand, Byte-Pair embeddings use
a variable-length encoding to iteratively merge the most frequent symbols pairs into a new
symbol. This strategy makes both of them a suitable alternative to handle words that are not
part of the input vocabulary and small training corpus.


5. Our solution approach
Our model is comprised of an optimized language-independent model 𝜇TC with pre-trained
embeddings specific for Tamil, Malayalam, and multi-language embeddings. All models are
ensembled by a linear combination of SVM classifier decision functions for each text transfor-
mations. The weights for each model are adjusted by using differential evolution. The model is
represented at Eq. 1.

                         Θ𝑝 = 𝛼1 𝑓 (ℒ1 (𝑇 )) + 𝛼1 𝑓 (ℒ2 (𝑇 )) … 𝛼𝑚 𝑓 (ℒ𝑚 (𝑇 )),                     (1)
                                Model              Precision   Recall   𝐹score
                                 𝜇TC                0.613      0.676    0.631
                           Char/BP-Multi            0.598      0.686    0.593
                       Char/BP-Tamil/BP-Multi       0.607      0.694    0.612
                         BP-Tamil/BP-Multi          0.601      0.692    0.609
                              BP-Multi              0.573      0.681    0.586
                         Linear combination         0.629      0.696    0.657

Table 3
Performance for the different evaluated text models for Tamil-English


where Θ𝑝 are the predicted values for the linear combination of the multiple SVM decisions
function for each text transformations, and 𝛼 is the contribution of each model to the final
decision function.
  The previous formulation allows us to state our solution as a constrained optimization
problem, where 𝛼 coefficients must be optimized to maximize a given fitness function. The
weighted 𝐹score score is maximized for this problem; Eq. 2 defines the optimization problem.

                                     maximize 𝐹score (Θ, Θ𝑝 )
                                         𝛼𝑖
                                                                                           (2)
                                     subject to 0 ≤ 𝛼𝑖 ≤ 1.

6. Experiments and results
We carry out experiments on five models and combinations: the optimized 𝜇TC, Byte-Pair em-
beddings for Multi-language (BP-Multi), language-specific Byte-Pair embeddings, and Character
Embeddings. The model selected for submission was the one exhibiting the best scores for
the evaluated metrics (weighted versions of 𝐹score , Precision, and Recall). Table 3 shows the
performances on the Tamil-English development dataset. As can be seen, the best single model
performance is achieved by the 𝜇TC model, while the linear combination increases close to 2%,
in our metrics.
   On the other hand, Table 4 shows the performance for each single text model and the
proposed linear combination on the Malayalam-English development dataset. Again the best
model performance is achieved by 𝜇TC, but for the Malayalam, the increase achieved by the
linear combination is lower than for the Tamil-English task.
   Table 5 shows the 𝛼 values for each one of the vector representation when the ensemble
model is optimized, see Eq. 2. For both cases, the model in which the highest contribution is
𝜇TC; however, its contribution is greater for the Malayalam dialect. The second with the higher
weight is corresponding to the staked model using Character and Byte-Pair for multi-language
embeddings. On the other hand, for Tamil dialect, the weights the Char/BP-Multi is the model
with the lowest contribution while stacking Byte-Pair Tamil and multi-language embeddings
contribute with the second-highest 𝛼 value.
                                Model             Precision     Recall   𝐹score
                                 𝜇TC                0.670       0.677    0.667
                           Char/BP-Multi            0.647       0.657    0.649
                       Char/BP-Tamil/BP-Multi       0.631       0.637    0.631
                         BP-Tamil/BP-Multi          0.629       0.637    0.630
                              BP-Multi              0.638       0.648    0.640
                         Linear combination         0.674       0.683    0.668

Table 4
Performance for the different evaluated text models for Malayalam-English


                                   Model             Malayalam       Tamil
                                   𝜇TC                 0.9472       0.7017
                              Char/BP-Multi            0.7071       0.0110
                          Char/BP-Lang/BP-Multi        0.1289       0.1373
                            BP-Lang/BP-Multi           0.3115       0.4892
                                BP-Multi               0.1075       0.3142

Table 5
Model’s weighting in our linear-combination ensembling schema, i.e., 𝛼𝑖 values in Eq. 2. Each column
lists the weighting vector for a single language.


7. A brief review of 𝜇TC best model’s parameters
For the sake of completeness, Table 6 shows the parameters for the best model obtained by the
𝜇TC system. We briefly summarize the involved parameters described in [12]. Parameters in
Table 6 may be roughly divided into preprocessing and weighting schemes.
  Parameters with handler suffix (i.e. emoji, hashtag, number, url, and user) have three possible
options: delete, group and identity. These delete removes occurrences of entities of the specified
kind while group option will change occurrences by a common identifier of the kind. For
instance, setting emoji-handler as delete will remove all emojis in the text. In contrast, group
option indicates that emoji’s occurrences must be replaced with the special token _emo; this
operation is designed for tasks that take advantage of the syntactic information of the token,
regardless of the precise value. The identity operator leaves the instances untouched. On the
other hand, binary parameters like diacritic-removal, duplication-removal, and punctuation-
removal instruct if symbols must be removed or not. The lower-case parameter establishes
whether the text is lower case or maintained as it is.
  Furthermore, 𝜇TC allows the use of three different tokenizers:

    • Words 𝑛-grams. This scheme tokenizes text into words and then produces all possible
      sub-sequences of 𝑛 words (i.e., 𝑚 −𝑛 +1 tokens for a text with 𝑚 words). For this parameter,
      Malayalam’s best performance model includes tokens of length 2, 3, 5, and 9. On the
      other hand, the Tamil model uses tokens of 1 and 3 words.
    • Sentences 𝑞-grams. This approach produces 𝑛-grams at the character level, i.e., each token
      is a sub-string of size 𝑛. Here, the best model for Tamil dialect has tokens of length 2 and
                          Parameter name          Dravidian dialect
                                                 Malayalam      Tamil
                          lower-case                      True        False
                          emojis-handler             identity        group
                          hashtag-handlers           identity        delete
                          url-handler                   group        group
                          user-handler                  group        group
                          number-handler               delete      identity
                          diacritic-removal               True        True
                          duplication-removal            False        True
                          punctuation-removal             True        True
                          𝑞-grams                        3, 2, 1        3, 2
                          𝑛-grams                     2, 3, 5, 9        1, 3
                          skip-grams                     none         (2, 1)
                          weighting scheme                tfidf           tf
                          token-max-filter                     1           1
                          token-min-filter                  −1          −1

Table 6
𝑚𝑢TC best configuration parameters for Tamil and Malayalam texts


      3. In contrast, the Malayalam includes sequences of 1, 2, and 3 characters.
    • Skip-grams are word 𝑛-grams that skip middle parts in words sub-sequences. For this
      tokenizer, it must specify the sub-sequence length and the number of middle-words to
      skip. This class of tokens is not used for the Malayalam best configuration, but for the
      Tamil best model from continuous word sequences of length three, the middle word is
      removed, obtaining (2,1) skip-grams.

7.1. Weighting
Two frequency-based weighting schemes are used over vector space for the bag of word
representations obtained using those mentioned above preprocessing and tokenization strategies.
Any of term frequency (TF) and term frequency-inverse document frequency (TFIDF) are selected
along with at frequency thresholds using token-min-filter 𝑐𝑚𝑖𝑛 , and token-max-filter 𝑐𝑚𝑎𝑥 . Where
all tokens do not reaching the frequency 𝑐𝑚𝑖𝑛 freq or having frequency greater than 𝑐𝑚𝑎𝑥 max-freq
are delete. The term max-freq stands for the frequency of the most repeated token in the
collection. Table 6 indicates that no token filtering is applied to the vocabularies.

7.2. Final Rank-list
We decided to submit the linear combination based on its development dataset’s performance.
Table 7 shows the rank for our approach for each task. In both cases, our method was ranked
above-average performance.
                              Task name          Precision    Recall   𝐹score   Rank
                            Tamil-English              0.61    0.68    0.63      3
                          Malayalam-English            0.68    0.69    0.68      6

Table 7
UMNSH-INFOTEC’s final rank (determined based on performance over test dataset)


8. Conclusions
This manuscript presents a model solution for the Dravidian code-mixed text task that integrates
language-specific knowledge from pre-trained models with a language-independent model. Our
approach is based on the linear combination of several independently created models using
differential evolution. Our proposal can be implemented with open-source libraries, using just
a relatively few code lines, and achieves competitive performances for the evaluated tasks. The
scripts and data used to implement our approach are available at Github1 .


References
 [1] A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling,
     in: COLING 2018, 27th International Conference on Computational Linguistics, 2018, pp.
     1638–1649.
 [2] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
     sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
     Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
     and Collaboration and Computing for Under-Resourced Languages (CCURL), European
     Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www.
     aclweb.org/anthology/2020.sltu-1.28.
 [3] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
     dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on
     Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
     and Computing for Under-Resourced Languages (CCURL), European Language Resources
     association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/
     2020.sltu-1.25.
 [4] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in
     Code-Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation,
     FIRE ’20, 2020.
 [5] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in
     Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation
     (FIRE 2020). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020.

   1
       https://github.com/kyriox/dravidian-codemixed
 [6] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation
     of under-resourced languages, Ph.D. thesis, NUI Galway, 2020.
 [7] E. S. Tellez, D. Moctezuma, S. Miranda-Jiménez, M. Graff, An automated text categorization
     framework based on hyperparameter optimization, Knowledge-Based Systems 149 (2018)
     110–123. URL: https://github.com/INGEOTEC/microtc.
 [8] T. Mikolov, K. Chen, G. Corrado, J. Dean, L. Sutskever, G. Zweig, word2vec, URL
     https://code. google. com/p/word2vec 22 (2013).
 [9] B. Heinzerling, M. Strube, Bpemb: Tokenization-free pre-trained subword embeddings in
     275 languages, arXiv preprint arXiv:1710.02187 (2017).
[10] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures
     for named entity recognition, arXiv preprint arXiv:1603.01360 (2016).
[11] J. Pennington, R. Socher, C. Manning, Glove: Global Vectors for Word Representation, in:
     Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), 2014. doi:1 0 . 3 1 1 5 / v 1 / D 1 4 - 1 1 6 2 .
[12] E. Tellez, D. Moctezuma, S. Miranda-Jiménez, M. Graff, An automated text categorization
     framework based on hyperparameter optimization, Knowledge-Based Systems (2018).
     doi:1 0 . 1 0 1 6 / j . k n o s y s . 2 0 1 8 . 0 3 . 0 0 3 .