Multi-task Learning for Cross-Lingual Sentiment
                   Analysis

             Gaurish Thakkar[0000−0002−8119−5078] , Nives Mikelic
     Preradović[0000−0001−9087−0074] , and Marko Tadić[0000−0001−6325−820X]

    Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb 10000,
                                       Croatia
            gthakkar@m.ffzg.hr, nmikelic@m.ffzg.hr, marko.tadic@ffzg.hr


       Abstract. This paper presents a cross-lingual sentiment analysis of
       news articles using zero-shot and few-shot learning. The study aims to
       classify the Croatian news articles with the positive, negative, and neu-
       tral sentiment using the Slovene dataset. The system is based on a trilin-
       gual BERT-based model trained in three languages: English, Slovene,
       Croatian. The paper analyses different setups of using datasets in two
       languages and proposes a simple multi-task model to perform sentiment
       classification. The evaluation is performed using the few-shot and zero-
       shot scenarios in single-task and multi-task experiments for Croatian and
       Slovene.

       Keywords: sentiment analysis · cross-lingual · transfer learning · multi-
       task learning · news sentiment · under-resourced languages.


1    Introduction
Sentiment analysis is one of the most exciting applications of Natural Language
Processing. This field encompasses text analysis ranging from the average cus-
tomer reviews of online products and movies to user-generated text from social
media platforms. Other applications include the analysis of financial news [2,7]
for stock market movement. While the field has a vast span across multiple sub-
areas, here we are interested in sentiment analysis of news articles. This paper
focuses on improving sentiment analysis of Croatian news articles tagged with
coarse-grained sentiment tags. Since Slovene and Croatian belong to the same
(sub)family of Slavic languages, Slovene was chosen as the hub language and in
the context of rather a small dataset available for Croatian.
    We seen the following items as the main contributions of our work:a) the
mutual dependence of the sentiment analysis task at three levels (i.e., document-
level, paragraph-level, and sentence-level) is leveraged to improve the performance
of cross-lingual sentiment analysis; b) shared language encoder representations
across both languages are proposed and c) a language from the same language
family is used for cross-lingual sentiment transfer1 .
  Copyright © 2021 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0).
1
  Source code: https://github.com/cleopatra-itn/SentimentAnalyserSLHRNews
2        G. Thakkar et al.

2     Related Work

The previous state-of-the-art computational processes for sentiment analysis re-
lied on sentiment lexicons [17] as well as other classical methods like TF-IDF
[11] and dealing with various features along with SVM [13]. As automatic feature
extraction became a common trend with the usage of deep learning techniques
[15,18], the Convolutional Neural Nets [21] took the lead, and then gradually have
been replaced by various Recurrent Neural Networks approaches [9,16,23,12].
    The machine-translation, as one of the well-studied cross-lingual techniques,
also aids to sentiment labelling. Its application ranges from lexicon translation
[1,3] up to instance translations [20]. Several recent works have explored the
use of transformers in multi-task learning setup for mono-lingual [5] and multi-
lingual setups [6,22] for sentiment classification.
    The closest work upon which our research is built [14] performs zero-shot
sentiment learning on Croatian news articles by enriching a masked language
model (mBERT) [8] using sentiment tags from the SentiNews and Croatian
news sentiment dataset. Our work is novel because of the way we use the dataset.
Previous work does not utilise paragraph-level and sentence-level annotations for
document-level classification but uses them for pre-training a masked-language
model. We utilise these annotations in a multi-task setup for aiding the sentiment
classification task for Croatian.


3     Datasets

We use two datasets in our experiments.
    SentiNews Dataset in Slovene This is a manually annotated dataset
[4]2 in the domain of news. It contains 10,427 documents. All annotations have
three levels of granularity, i.e., document, paragraph, and sentence-level. The
dataset covers news from economics, finance, and politics published between 1
September 2007 and 31 December 2013. It contains all instance annotations of
each annotator, along with the news content and the final sentiment label. The
dataset has been annotated at a five-point Likert scale and has mapping onto
three labels (Positive, Negative, and Neutral) using the scale’s average score.
The overall distribution of this dataset is given in Table 1.


                  Table 1. Slovene and Croatian dataset statistics.

             Lang      Level Examples Positive Negative Neutral
                     Documents 10,427  1,665    3,337     5,425
            Slovene Paragraphs 89,999  14,636   23,721   51,642
                     Sentences 165,071 27,091   44,629   93,351
            Croatian Documents  2,025   325      456      1,244

2
    https://www.clarin.si/repository/xmlui/handle/11356/1110
                   Multi-task Learning for Cross-Lingual Sentiment Analysis          3

    Sentiment Dataset in Croatian The Croatian dataset 3 was created us-
ing guidelines similar to those of the SentiNews dataset. The text content comes
from the Croatian 24sata daily news portal. It covers topics such as health,
lifestyle, and automotive news. Table 1 shows the statistics for this dataset. Like
the Slovenian corpus, this dataset is also annotated with 3-class sentiment la-
bels and covers the same domain. However, it does not contain paragraph and
sentence-level annotations. Both datasets present an imbalanced-class distribu-
tion phenomenon.


4     Methodology

We strive to leverage contextual information from the Slovene dataset accessible
at three distinct levels and dataset from Croatian in our proposed system to
promote knowledge transfer between two languages. As Croatian and Slovene
proved to have the highest level of mutual intelligibility among three South
Slavic languages (Croatian, Slovene and Bulgarian) represented in the mutual
intelligibility study [10], we hypothesise Slovene could be the right candidate
as a hub language for cross-lingual knowledge transfer to Croatian. Also, both
datasets belong to the same text type – news. A simple multi-task learning setup
with three different task heads is employed. Here, each task head represents the
classification layer responsible for classifying the given instance of a particular
type. The instances are of three types, namely document-level, paragraph-level,
and sentence-level. The document-classification layer is trained by concatenating
the Croatian and Slovene dataset. The paragraph-classification and the sentence-
classification layer are only fed with Slovene instances since the Croatian dataset
does not provide this information.
    A shared encoder is used for feature extraction, enabling feature sharing
across all three tasks and both languages. Our model is trained in five different
scenarios in total, while the overall architecture is presented in Fig. 1. In the zero-
shot learning setting, we do not use the Croatian data in training. However, use
the Croatian test for reporting the performance.

 1. Single-task-SL-Zero-shot-HR - Train only Slovene document-level data.
    In this setting, the Slovene data is used in the zero-shot setting for the
    Croatian sentiment analysis.
 2. Multi-task-SL-Zero-shot-HR - Train using all three levels on Slovene
    data. This setting is the same as the previous one, except that all three
    classification heads were learning respective tasks using Slovene data. We
    did not use any Croatian data in this or previous setting.
 3. Single-task-HR - Train only using Croatian document-level data. This
    setting is a classic fine-tuning setup involving a single head learning classifi-
    cation on a single dataset.
3
    https://www.clarin.si/repository/xmlui/handle/11356/1342
4      G. Thakkar et al.

 4. Multi-task-HR+SL - Train with Croatian (document-level) and Slovene
    data (all three levels). We concatenated the respective document-level in-
    stances from respective languages at the same time. The other heads were
    trained using compatible instances from Slovene datasets.
 5. Single-task-HR+SL - Train with Slovene and Croatian data (document-
    level). This approach is similar to the previous one, but the single document-
    level classification head is trained.


Fig. 1. The overall architecture of the proposed system. A shared encoder and three
specific classification heads.


5     Experimental Setup
This section presents a brief description of the pre-processing steps, followed by
the details of the experiments.

5.1   Preprocessing
A few pre-processing steps were performed on the dataset.
 1. The empty string values, which have neutral tags, were dropped. As null
    strings have no content, we decided to drop these instances.
                  Multi-task Learning for Cross-Lingual Sentiment Analysis        5

 2. The strings based on content were de-duplicated. Many strings in the dataset
    were duplicates. We performed this step in order to prevent leakage of the
    instances into validation or test set split.

Table 2 depicts the overall distribution of the dataset. There is a drop of 3k
instances in the sentence-level Slovene dataset.


                  Table 2. Dataset statistics after preprocessing.

          Language Level      Examples Positive Negative Neutral
                   Documents     10,417   1,665    3,337   5,418
          Slovene  Paragraphs    86,803  14,270   23,265 49,268
                   Sentences    161,291  26,679   44,014 90,598
          Croatian Documents      1,988     321      450   1,217


5.2   Language Model - Shared encoder

Our pipeline used a shared encoder based on a BERT-based masked language
model. The tri-lingual model named CroSloEngual [19] trained with three lan-
guages (Slovene, Croatian, and English) on 5.9 billion tokens altogether. This
model was chosen as it considers two languages that we are interested in trans-
ferring knowledge. The model outperforms the mBERT [8] for the task of Part
of Speech (POS) tagging, Named Entity Recognition (NER) and dependency
parsing for these three languages.


5.3   Experiments

In the datasets defined in Table 2, the data is split into 80:20 train-test split
in a stratified fashion. The 10% or a proportionate train set is kept aside as a
development set, since all the datasets differ in size. All datasets are combined in
a single collection. The combined datasets behave as our size-proportional pop-
ulation. Our model is trained sequentially in a single-task setting by randomly
sampling tasks from this collection. Using the test set from Croatian and Slovene,
we evaluated our proposed approach. Table 3 shows the model configuration. All
our models were trained using Nvidia RTX 3090 (24GB) with a batch size of 32.
All hyperparameters were constant during the whole experiments except epoch
which varied from 5 for a single task to 3 for a multi-task setup. These values
were chosen to prevent overfitting on the train set. The overall training time for
the MTL setup was 3 hours. We evaluated the performance on the development
set and chose the best model for reporting test performance.
    Table 3 reports all the results of our experiment.
6        G. Thakkar et al.

                             Table 3. Model Configurations.

                       Parameters         Values
                        Optimizer     Adam (lr =2e-5)
                          Loss    Categorical cross-entropy
                         Output           Softmax
                          Batch              32
                         Epochs       3-STL & 5-MTL
                        Dropout              0.3
                         Labels               3


5.4     Results


The single-task (STL) results and the multi-task (MTL) setup results are re-
ported in Table 4. Precision, recall, and F1 are macro averaged for all the ex-
periments. We used a simple majority-class classifier as our baseline. Our MTL
setup with Croatian and Slovene dataset outperforms the other settings for Croa-
tian sentiment classification. However, it does not perform best when tested on
document-level classification for Slovene. There is a small drop in the perfor-
mance similar to what was previously reported by [14]. The second-best per-
forming model for Croatian is the single-task variant with Slovene and Croatian
data. The worst performing model on Croatian was the MTL with only Slovene
data and the STL model, which was performing zero-shot learning. Nevertheless,
the SL MTL performs better on Slovene test-sets. The HR STL, which used the
least amount of data compared to other settings, seemed to perform at par with
SL STL in F1(55.61 vs 56.95), but both had contrasting precision and recall.
    For paragraph-level, SL MTL has similar performance to SL+HR MTL, but
we see a slight drop in performance in the latter case. A similar observation can
be made for sentence-level classification.


Table 4. Results of the experiments. Since the datasets are imbalanced, we present
macro-averaged precision, recall, and F1 measure.

    Train set                                 Test set
                                      Slovene                            Croatian
                  Document           Paragraph         Sentence         Document
                 P    R     F1     P     R    F1     P    R     F1    P     R     F1
Majority class 17.33 33.33 22.80 18.91 33.33 24.13 18.72 33.33 23.97 20.43 33.33 25.33
SL STL         70.56 71.65 71.07   -     -     -     -     -     -   57.88 63.86 55.61
SL MTL         73.15 77.66 74.86 70.58 73.50 71.83 68.77 70.13 69.40 56.48 62.83 50.07
HR STL         56.84 44.92 43.86   -     -     -     -     -     -   61.41 56.34 56.95
SL+HR STL 68.99 72.14 70.16        -     -     -     -     -     -   62.80 64.66 63.53
SL+HR MTL 72.32 78.00 74.21 70.36 74.03 71.84 68.32 70.34 69.21 63.01 67.54 63.86
                  Multi-task Learning for Cross-Lingual Sentiment Analysis       7

6   Discussion

The research presented in this paper has revealed that having a small amount of
target language (Croatian) data helps in overall cross-lingual transfer learning.
Adding data from another language hinders the source language task’s perfor-
mance but improves task performance in the target language. Comparing our
work with [14], which proposes joint optimisation of language modelling task
and paragraph-level sentiment analysis for the document-level sentiment classi-
fication, we did not perform any intermediate training but utilised the available
annotation directly at the fine-tuning stage. Another significant difference is the
use of a trilingual shared encoder which performs better than mBERT. We have
qualitatively analysed the MTL model’s errors on the test set and discovered
that the model does pick up cues from the sentiment bearing words from the
input. However, the analysis of these errors would be beyond the scope of this
paper. The current model tags the news document as positive or negative when
it finds positive or negative words in neutral news. The articles, which are adver-
tisements or recipes, contain words with positive sentiment. Most of the errors
belong to this class. Since our encoder has fixed-length input, the truncated text
prevents correct classification. We can solve this problem by performing a slid-
ing window sampling over the text or using the beginning and the end of each
article.


7   Conclusion

We presented an overall setup for cross-lingual sentiment analysis (SA) of Croa-
tian news documents using a Multi-task learning approach. The goal was to
perform knowledge transfer using existing datasets and models to aid SA in
Croatian. For this purpose, we used a large Slovene SentiNews corpus created
similarly to the Croatian corpus.
     The publicly available trained trilingual BERT-based language model for fea-
ture representation was utilised and used as a shared encoder for the downstream
tasks. We combined the Croatian and Slovene datasets for document-level clas-
sification and trained three different classification heads. The results show that
the MTL setup outperformed the STL setup for Croatian.
     In the future, we would like to experiment more in an under-resourced setting
for Slovene and balance the dataset among distinct levels. Another interesting
approach would be to use each level of the datasets individually in a hierarchi-
cal fashion to help with the next level’s classification task. For example, using
sentence-level features to aid the paragraph classification could be fused into the
document-level prediction process. We believe that processing documents from
different text types and topics separately would be an easier task. Therefore,
in our future work, we plan to cluster documents into text-types and topics
like recipes, advertisements or obituaries and process them separately. Also, our
experiment setting could be further checked by running it for other language
pairs supported by this CroSloEngual BERT-based model: English-Slovene and
8       G. Thakkar et al.

English-Croatian. In this way, we could verify whether the genetically and ty-
pologically distant language, as English here is, would contribute to the perfor-
mance.


8    Acknowledgements
The work presented in this paper has received funding from the European
Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-
Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-
lingual Event-centric Open Analytics Research Academy).


References
 1. Abdalla, M., Hirst, G.: Cross-lingual sentiment analysis without (good) trans-
    lation. In: Proceedings of the Eighth International Joint Conference on Natural
    Language Processing (Volume 1: Long Papers). pp. 506–515. Asian Federation of
    Natural Language Processing, Taipei, Taiwan (Nov 2017), https://www.aclweb.
    org/anthology/I17-1051
 2. Agić, Ž., Ljubešić, N., Tadić, M.: Towards sentiment analysis of financial texts
    in croatian. In: Proceedings of the Seventh International Conference on Language
    Resources and Evaluation (LREC’10) (2010)
 3. Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., van der Goot, E., Halkia,
    M., Pouliquen, B., Belyaeva, J.: Sentiment analysis in the news. In: Proceedings
    of the Seventh International Conference on Language Resources and Evaluation
    (LREC’10) (2010)
 4. Bučar, J., Žnidaršič, M., Povh, J.: Annotated news corpora and a lexicon for sen-
    timent analysis in slovene. Language Resources and Evaluation 52(3), 895–919
    (2018)
 5. Cer, D., Yang, Y., Kong, S.y., Hua, N., Limtiaco, N., St. John, R., Con-
    stant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Strope, B., Kurzweil, R.:
    Universal sentence encoder for English. In: Proceedings of the 2018 Confer-
    ence on Empirical Methods in Natural Language Processing: System Demon-
    strations. pp. 169–174. Association for Computational Linguistics, Brussels, Bel-
    gium (Nov 2018). https://doi.org/10.18653/v1/D18-2029, https://www.aclweb.
    org/anthology/D18-2029
 6. Chidambaram, M., Yang, Y., Cer, D., Yuan, S., Sung, Y.H., Strope, B., Kurzweil,
    R.: Learning cross-lingual sentence representations via a multi-task dual-encoder
    model. arXiv preprint arXiv:1810.12836 (2018)
 7. Day, M.Y., Lee, C.C.: Deep learning for financial sentiment analysis on finance
    news providers. In: 2016 IEEE/ACM International Conference on Advances in
    Social Networks Analysis and Mining (ASONAM). pp. 1127–1134. IEEE (2016)
 8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 9. Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., Xu, K.: Adaptive recursive neural
    network for target-dependent twitter sentiment classification. In: Proceedings of
    the 52nd annual meeting of the association for computational linguistics (volume
    2: Short papers). pp. 49–54 (2014)
                    Multi-task Learning for Cross-Lingual Sentiment Analysis               9

10. Golubović, J., Gooskens, C.: Mutual intelligibility between west and south slavic
    languages. Russian linguistics 39(3), 351–373 (2015)
11. Lin, K.H.Y., Yang, C., Chen, H.H.: Emotion classification of online news articles
    from the reader’s perspective. In: 2008 IEEE/WIC/ACM International Conference
    on Web Intelligence and Intelligent Agent Technology. vol. 1, pp. 220–226. IEEE
    (2008)
12. Majumder, N., Poria, S., Hazarika, D., Mihalcea, R., Gelbukh, A., Cambria, E.:
    Dialoguernn: An attentive rnn for emotion detection in conversations. In: Pro-
    ceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 6818–6825
    (2019)
13. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classifica-
    tion using machine learning techniques. In: Proceedings of the 2002 Con-
    ference on Empirical Methods in Natural Language Processing (EMNLP
    2002). pp. 79–86. Association for Computational Linguistics (Jul 2002).
    https://doi.org/10.3115/1118693.1118704, https://www.aclweb.org/anthology/
    W02-1011
14. Pelicon, A., Pranjić, M., Miljković, D., Škrlj, B., Pollak, S.: Zero-shot learning for
    cross-lingual news sentiment classification. Applied Sciences 10(17), 5993 (2020)
15. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.:
    Recursive deep models for semantic compositionality over a sentiment treebank.
    In: Proceedings of the 2013 conference on empirical methods in natural language
    processing. pp. 1631–1642 (2013)
16. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
    networks. arXiv preprint arXiv:1409.3215 (2014)
17. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods
    for sentiment analysis. Computational linguistics 37(2), 267–307 (2011)
18. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-
    structured long short-term memory networks. In: Proceedings of the 53rd Annual
    Meeting of the Association for Computational Linguistics and the 7th International
    Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp.
    1556–1566 (2015)
19. Ulčar, M., Robnik-Šikonja, M.: Finest bert and crosloengual bert. In: Sojka, P.,
    Kopeček, I., Pala, K., Horák, A. (eds.) Text, Speech, and Dialogue. pp. 104–111.
    Springer International Publishing, Cham (2020)
20. Wan, X.: Co-training for cross-lingual sentiment classification. In: Proceedings of
    the Joint Conference of the 47th Annual Meeting of the ACL and the 4th In-
    ternational Joint Conference on Natural Language Processing of the AFNLP. pp.
    235–243 (2009)
21. Wang, W., Feng, S., Gao, W., Wang, D., Zhang, Y.: Personalized microblog
    sentiment classification via adversarial cross-lingual multi-task learning. In: Pro-
    ceedings of the 2018 Conference on Empirical Methods in Natural Language
    Processing. pp. 338–348. Association for Computational Linguistics, Brussels,
    Belgium (Oct-Nov 2018). https://doi.org/10.18653/v1/D18-1031, https://www.
    aclweb.org/anthology/D18-1031
22. Yang, Y., Cer, D., Ahmad, A., Guo, M., Law, J., Constant, N., Ábrego, G.H., Yuan,
    S., Tar, C., Sung, Y., Strope, B., Kurzweil, R.: Multilingual universal sentence
    encoder for semantic retrieval. CoRR abs/1907.04307 (2019), http://arxiv.
    org/abs/1907.04307
23. Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-
    attention recurrent network for human communication comprehension. In: Pro-
    ceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)