Cross-Lingual Propagation of Sentiment Information
              Based on Bilingual Vector Space Alignment
                              Flavio Giobergia                                                              Luca Cagliero
                             Politecnico di Torino                                                        Politecnico di Torino
                                  Turin, Italy                                                                 Turin, Italy
                          flavio.giobergia@polito.it                                                     luca.cagliero@polito.it

                                  Paolo Garza                                                                Elena Baralis
                             Politecnico di Torino                                                        Politecnico di Torino
                                  Turin, Italy                                                                 Turin, Italy
                             paolo.garza@polito.it                                                       elena.baralis@polito.it

ABSTRACT                                                                               sentiment embeddings are known (usually in English) and a
Deep learning methods have shown to be particularly effective                          lexicon that maps English words to those in the target language.
in inferring the sentiment polarity of a text snippet. However, in                     The mapping indicates the semantic relationship between pairs of
cross-domain and cross-lingual scenarios there is often a lack of                      words. First, the word-level sentiment polarity in various domains
training data. To tackle this issue, propagation algorithms can                        is extracted in the source language using a supervised transfer
be used to yield sentiment information for various languages                           learning process. Then, the vector scores for the target language
and domains by transferring knowledge from a source language                           are induced using stochastic gradient descent. A more detailed
(usually English). To propagate polarity scores to the target lan-                     description of [9] is given in Section 3.
guage, these algorithms take as input an initial vocabulary and a
                                                                                          Challenge. The quality of the sentiment score propagation
bilingual lexicon. In this paper we propose to enrich lexicon in-
                                                                                       strongly depends on the richness of the bilingual lexicon. When
formation for cross-lingual propagation by inferring the bilingual
                                                                                       a bilingual lexicon is either not available or partly incomplete, the
semantic relationships from an aligned bilingual vector space.
                                                                                       induction phase is unable to effectively propagate the sentiment
This allows us to exploit the underlying text similarities that are
                                                                                       polarity scores from the original language to the target one.
not made explicit by the lexicon. The experiments show that our
approach outperforms the state-of-the-art propagation method                              Research goals. The goal of this work is to improve the quality
on multilingual datasets.                                                              of the sentiment propagation phase across languages. The key
                                                                                       idea is to enrich lexicon information in the propagation phase by
1    INTRODUCTION                                                                      deriving the semantic links among word pairs from an aligned
                                                                                       bilingual vector space. This allows us to exploit the underlying
In the last decade, an increasing amount of opinionated data
                                                                                       text similarities that are not made explicit in the bilingual lexi-
has been recorded in digital forms (e.g., reviews, tweets, blogs).
                                                                                       con. We use an established model for vector representation of
This has fostered the joint use of Natural Language Processing
                                                                                       words, i.e., fastText [4]. A deep learning approach to generate
(NLP) and Machine Learning (ML) techniques to extract people’s
                                                                                       aligned fastText word vectors has recently been proposed [13].
opinions, sentiments, emotions, and attitude from text, i.e., the
                                                                                       Once trained, the bilingual vector spaces not only embed lexi-
sentiment analysis (or opinion mining) problems [16].
                                                                                       con information but also allow us to derive non-trivial semantic
   The recently proposed approaches (e.g., [1, 8, 9]) aim to pre-
                                                                                       text relationships directly from the latent space. This simplifies
dict the sentiment polarity of the analyzed text by means of deep
                                                                                       the procedure of cross-lingual induction and exploits the vector
learning techniques. However, Deep Neural Networks require a
                                                                                       representation of text in the latent space to infer missing word
sufficiently large corpus of labeled data in order to train accurate
                                                                                       relationships. The authors of [13] have also published the pre-
sentiment predictors [11]. Meeting such a requirement could be
                                                                                       trained aligned vectors for a large number of languages. Hence,
challenging while coping with multilingual and cross-domain
                                                                                       a promptly usable, general-purpose vector representation of text
data. In particular, the majority of the annotated text is written in
                                                                                       is currently available.
English whereas small amounts of data are available for less com-
monly spoken languages. Furthermore, the sentiment of a text                              Approach. To propagate the multi-domain sentiment polar-
snippet strongly depends on its surrounding context. For exam-                         ity scores of a word in the original language (e.g., English), we
ple, a word may have different connotations in different domains.                      explore the bilingual aligned vector space. Specifically, an arbi-
Hence, tailoring DNN models to the right domain and language is                        trary word in the original vocabulary is described by two high-
crucial for developing accurate and portable sentiment analyzers.                      dimensional vectors: a latent vector in the original embedding
   A promising strategy to overcome the lack of multilingual                           space and a sentiment vector describing the sentiment polarity of
training data has recently been proposed by [9]. They propose                          the word in different domains. Thanks to the bilingual model, we
an approach to propagate sentiment information, encoded into                           project words from the original word embedding to the vector
high-dimensional embedding vectors [17], across languages. The                         space of the target language and look for its nearest neighbors.
idea behind this is to consider an initial vocabulary for which                        Neighbors are likely to be semantically related to the original
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-   word (regardless of the presence of an explicit link in the bilingual
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,       lexicon). The semantic links and the similarity scores between the
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
tribution 4.0 International (CC BY 4.0)
                                                                                       projected word and the neighbors are used to drive the sentiment
                                                                                       propagation phase towards the target language.
   Achievements. The proposed approach produces sentiment              domains1 . If the same word is used in multiple languages, each
embeddings that outperform the state-of-the-art embeddings             instance is treated as a distinct word in V .
by [9] on various multilingual benchmark datasets (e.g., using            The sentiment vector associated with each word x in the initial
the embeddings with an SVM classifier, it achieves 10% average         vocabulary V0 is derived via transfer learning. Specifically, for
macro F 1 score improvement on the Italian datasets).                  each domain d j a linear Support Vector Machine classifier [15]
                                                                       M j (x) = w j ·x +b is trained from a set of domain-specific textual
   Paper outline. Section 2 overviews the related literature. Sec-     documents. The classifier assigns a polarity to each word, denot-
tion 3 summarizes the cross-lingual propagation method pre-            ing whether the word is peculiar to the domain under analysis.
sented by [9] and introduces the mathematical notation used            The j-th component of the embedding vector v x incorporates the
throughout the paper. Section 4 presents the proposed approach.        coefficient w j of the linear model M j . Hereafter, we will assume
Section 5 reports the outcomes of the empirical evaluation. Fi-        sentiment information to be known for an initial vocabulary
nally, Section 6 draws conclusions and presents the future works.      V0 ⊂ V , which usually consists of a subset of English words (i.e.,
                                                                       the most popular language used in electronic documents).
                                                                          The translation lexicon TL is a set of triples {(x 1, x 1′ , we 1 ),. . .,
2   RELATED WORK                                                       (xm , xm
                                                                              ′ , we )} [x , x ′ , . . . , x , x ′ ∈ V ] providing evidence of
                                                                                    m       1 1             m m
To predict the sentiment polarity of textual reviews, news, and        the semantic relationships holding between pairs of words in the
posts, several deep learning-based sentiment analysis approaches       multilingual vocabulary. The lexicon maps words in the origi-
have been proposed. Most of them (e.g., [2, 12, 18]) are language-     nal language to the corresponding translations. Notice that each
or domain-specific, i.e., they are specifically tailored to a given    word may have multiple translations. To incorporate relation-
context (e.g., movie reviews, Twitter posts) and language. Hence,      ships such as synonyms, orthographic variants, and etymological
model learning assumes that a large enough training set is avail-      connections, the lexicon includes also links between pairs of
able. Unfortunately, in many real contexts and for various lan-        words of the same language. In [9] the translation lexicon is
guages this is not the case.                                           extracted from a multilingual Wiktionary dump [6]. The links be-
   To extend the applicability of existing sentiment analysis so-      tween pairs of words represent translations, synonyms, morphol-
lutions towards other languages, the use of automated machine          ogy, derivation, and etymological links. The weight associated
translation tool has been investigated [3, 7, 10, 20]. The main        with a triple (xq , xq′ , weq ) [1 ≤ q ≤ m] denotes the relevance
drawbacks of using automatic text translation tools are that the       of the semantic relationship. In [6] it indicates the number of
process is computationally intensive, the generated translations       semantic links occurring in the data source.
are prone to errors, and the tools often miss the word semantic           The sentiment vectors of all the words in V are populated by
differences according to the context of use [10]. Parallel strate-     propagating the cross-domain polarity scores of the words in V0
gies entail (i) building sentiment lexicons tailored to different      via an iterative optimization approach, i.e., Stochastic Gradient
languages and domains of interest and exploiting them to train         Descent (SGD). The optimization problem addressed by the SGD
supervised models [5] and (ii) integrate syntax-based rules in         entails assigning values to the sentiment vector v x for all words
unsupervised models [19]. However, all the aforementioned ap-          x in the multilingual vocabulary V according to the following
proaches require a significant human effort, which has already         objective function:
been accomplished only for the major languages and the most
popular domains.                                                                     Õ
                                                                               C·           ||v x − ṽ x ||2 +
   To propagate sentiment information across different languages
                                                                                    x ∈V0
and domains, a deep learning approach has recently been pre-                                    "                                                          #
sented [9] by Dong and De Melo. They consider an initial vo-                       Õ                       1                      Õ
                                                                               −          vTx                            ·                     we v   x′
cabulary of English words for which sentiment embeddings are                                        (x ,x ′ ,we)∈TL we
                                                                                                    Í
                                                                                   x ∈V                                      (x ,x ′ ,we)∈TL
known and a translation lexicon representing semantic relation-
ships between pairs of words (both non-English and English             where, given a word x ∈ V0 , ṽ x represents its initial sentiment
words) such as translation, synonym, orthographic variants, and        vector (learned through transfer learning).
other semantic, morphological, and etymological word relation-             The first term of the loss function ensures that the sentiment
ships is given. In [9], links between words are extracted from a       vectors of the words in the initial vocabulary V0 do not diverge
multilingual Wiktionary dump [6]. However, for the languages           significantly from the original ones, for a large enough constant C.
and domains not yet supported by the Etymological Wordnet              The second term guarantees that the inferred sentiment vectors of
Project, still the method is unable to propagate sentiment in-         words that are linked together (to some extent) in the translation
formation. Furthermore, some relevant semantic links could be          lexicon are kept similar. A drawback of the aforesaid loss function
missing in the input lexicon. Our proposal is to rely on an aligned    is that the dot product in the second term allows for arbitrarily
bilingual vector space, e.g., [13], from which explicit and implicit   large magnitudes for the inferred sentiment vectors. Indeed, the
semantic relationships among words can be inferred.                    dot product can grow by indefinitely increasing the magnitude
                                                                       of the vectors that are being learned. This issue will be addressed
                                                                       by the proposed approach.
3   SENTIMENT PROPAGATION BASED ON
    TRANSLATION LEXICON                                                4    PROPOSED APPROACH
To propagate sentiment information to various languages, the ap-       Instead of propagating sentiment information directly by means
proach proposed by [9] generates a sentiment embedding vector          of the translation lexicon, we aim to link the semantically related
v x for each word x in a multilingual vocabulary V . A sentiment       words indirectly according to their similarity in a bilingual word
embedding is a high-dimensional vector reflecting the distribu-        1 The vectors used in the experiments have 26 dimensions, one for each domain
tion of the word’s sentiment polarities across a large range of        plus an extra dimension combining all domains together.
embedding space. The idea behind this is to improve the quality                  x      x’s nearest neighbors Cosine similarity
of the cross-lingual propagation phase by projecting polarity                 excellent       eccellente               0.575
scores extracted from a richer text representation based on latent                              ottimo                 0.513
spaces.                                                                                      apprezzabile              0.369
                                                                                                 buon                  0.367
4.1      Bilingual embedding space                                                              adatto                 0.322
Each word in a dictionary is mapped to a vector in the latent              Table 1: Example: K nearest neighbors of excellent. Origi-
space. The application of word embeddings to address many nat-             nal language = English, target language = Italian, K = 5
ural language processing tasks is established. A pioneering work
in this field is the Word2Vec model [17]. fastText is a famous
extension of Word2Vec, which has been presented in [4]. It pro-
vides a more effective vector representation by incorporating
                                                                                                     10   9.5 9.9 9.8 9.9
sub-words in the input dictionary. The vectors associated with
the sub-words can be conveniently combined in order to gen-                                       excellent
erate the embeddings of new words that are not present in the                                                      0.5
                                                                                                      3               75
                                                                                                  0.51
dictionary.
   Vector representations of text are generated, separately for
each language, using deep learning architectures. However, pre-                        ottimo               eccellente
trained vector models (learned from the Wikipedia corpus) are
also available for a large number of languages2 . To links words in
different languages, the per-language models need to be aligned
first. The procedure to align bilingual fastText vector spaces is          Figure 1: Example: word sub-graph associated with excel-
thoroughly described in [13]. Notably, a large number of pre-              lent. Original language = English, target language = Italian,
trained aligned models is available3 . This allows users to exploit        K = 5, α = 0.4
the general-purpose, multilingual vectors (characterized by 300
dimensions and trained from Wikipedia for 44 languages) without
the need for retraining them from scratch.                                 weight of the edge connecting words x and x ′ indicates the pair-
                                                                           wise word similarity in the latent space and is computed using
4.2      Sentiment propagation strategy                                    the cosine similarity [15]. To avoid introducing unreliable word
Let Eo be the fastText embedding space in the original language            relationships and to limit graph connectedness, we filter out the
(e.g., English) and let ET be the aligned embedding space in the           edges (links) with weight below a given (user-specified) thresh-
target language (i.e., a language other than English). Each word           old α. The effect of parameters K and α on the performance
x in the original language has a corresponding vector v xeo in Eo .        and complexity of the proposed approach will be discussed in
Thanks to the aligned bilingual vector space, we can project v xeo         Section 5.
to the target vector space in order to get the corresponding target           Example. Suppose that the original language is English and
vector v xet in ET . Notice that the new vector does not necessarily       the target language is Italian. Let us consider the following input
correspond to any real word in the target language.                        parameters: K = 5 and α = 0.4. Table 1 and Figure 1 report an
   We exploit word similarities in the latent space to propagate           example related to the English word excellent. Specifically, Table 1
sentiment information. Specifically, let v xs be the sentiment vector      reports the five nearest neighbors of excellent while Figure 1
of an arbitrary word x ∈ V0 . We aim to propagate sentiment                shows the word sub-graph associated with that word. Only the
information to other words in V \V0 . This issue is addressed in           first two neighbors of excellent are characterized by a cosine
two steps: (i) first, we create a word graph representing the most         similarity greater than or equal to 0.4. Hence, only the Italian
significant pairwise word similarities. (ii) Next, we propagate            words eccellente and ottimo are connected to excellent in the word
the sentiment scores to the other words using gradient descent.            graph G. This is semantically correct because eccellente is the
Unlike [9], we adopt a new loss function tailored to the problem           Italian translation of excellent and ottimo has a similar meaning.
under analysis. Notice that step (i) allows us to get a richer word        The three discarded neighbors are other “positive” adjectives but
representation compared to a bilingual lexicon.                            they do not have the same meaning of excellent (the translations
                                                                           of the other three neighbors are appreciable, good, and suitable,
   Word Graph Creation. The word graph G = (V , E) is a undi-              respectively). Hence, the enforcement of the minimum similarity
rected weighted graph connecting pairs of words in V . Edges               threshold helps us to remove noisy connections.
in E are triples (x, x ′ , w x x ′ ), where x, x ′ ∈ V are the connected      The English word excellent, which is one of the words in V0 , is
vertices and w x x ′ is the edge weight. For each word x ∈ V we            characterized by a sentiment embedding (i.e., a vector of cross-
explore the neighborhood of vector v xet in the target latent space        domain polarity scores). The sentiment vectors of the Italian
to look for the neighbor words that are most semantically related          words are populated by propagating the cross-domain polarity
to x. More specifically, we look for the K nearest vectors (where K        scores of the English words via the iterative optimization ap-
is a user-specified parameter) corresponding to the words of the           proach described in the following.
target languages that are closest to v xet and select these words.
   Given the set N N x of x’s nearest neighbors, we create a                  Gradient Descent with Updated Loss Function. The Gradient
weighted edge e ∈ E connecting every x ′ ∈ N N x to x. The                 Descent is used to propagate sentiment information through the
                                                                           word graph. As discussed in Section 3, the iterative propagation
2 https://fasttext.cc/                                                     process should both preserve the values of the vectors in the
3 https://fasttext.cc/docs/en/aligned-vectors.html                         initial vocabulary V0 and guarantee a high degree of similarity
between the sentiment vectors of linked words. To achieve these                                                Dataset Cardinality #Positive #Negative
goals, we adopt the following objective function:                                                                 cs         2,458        1,660         798
                                                                                                                  de         2,407        1,839         568
       Õ                                                                                                          es         2,951        2,367         584
 C·           ||v x − ṽ x ||2 +                                                                                  fr         3,912        2,080        1,832
      x ∈V0                                                                                                       it         3,559        2,867         692
                                                                                                                  nl         1,892        1,232         660
                    "                                                                            #
     Õ                             1                              Õ
 −          vx − Í                                     ·                          w x x ′ vx ′                    ru         3,414        2,500         914
     x ∈V               (x ,x ′ ,w x x ′ )∈E w x x ′       (x ,x ′ ,w x x ′ )∈E                      2   Table 2: Cardinality and class distribution for each of the
where, for each word x in the initial vocabulary V0 , the first term                                     datasets presented in [9]
minimizes the deviation from its initial sentiment embedding
vector ṽ x . The second term minimizes the deviation from the
sentiment vectors of neighbors, represented as connected words                                                  Dataset Cardinality #Positive #Negative
in the word graph. Adopting the L2-norm in the second terms                                                       IT1        10,024         5,012        5,012
allows the propagation of the vector dimensions without altering                                                  IT2        13,888         6,942        6,946
the vector magnitude. Therefore, words in the initial vocabulary
                                                                                                              Table 3: Key statistics for the new Italian datasets
keep, to a good approximation, the same original vectors, whereas
new words get sentiment polarity scores similar to those of their
neighbors in the target latent space.

5     EXPERIMENTS                                                                                        more deemed as more suitable for evaluating imbalanced datasets,
                                                                                                         i.e., datasets for which the class labels are unevenly distributed
The experiments presented in this section are aimed at evalu-
                                                                                                         in the training data.
ating the quality of the sentiment vectors resulting from the
                                                                                                             Classifiers are trained on a vector representation of the input
application of the proposed methodology. The evaluation pro-
                                                                                                         text snippets. The vectors associated with each snippet are com-
cess is formulated as a binary sentiment analysis problem. The
                                                                                                         puted by averaging the values of the vector dimensions of the
sentiment embeddings are compared, in terms of macro-F 1 score,
                                                                                                         words included in the snippet. Notice that the aforesaid task could
with those produced by the method presented by [9].
                                                                                                         be alternatively addressed using Recurrent Neural Networks and
   All the experiments were run on a machine equipped with
                                                                                                         Convolutional Neural Networks [14]. The comparison between
Intel® Xeon® X5650, 32 GB of RAM and running Ubuntu 18.04.1
                                                                                                         different Deep Learning techniques is out of the scope of the
LTS.
                                                                                                         present work.
   The rest of this section is organized as follows. Subsection 5.1
                                                                                                             Classifier performance achieved on the sentiment vectors pro-
describes the settings used in the experimental validation as
                                                                                                         duced by our method are compared with that of the vectors
well as the analyzed datasets. Subsection 5.2 summarizes the
                                                                                                         produced by [9]. The comparison is aimed at showing the higher
main results. Subsection 5.3 discusses the influences of the main
                                                                                                         effectiveness of the proposed approach compared to state-of-the-
parameters of the performance of the proposed approach. Finally,
                                                                                                         art solutions.
Subsection 5.4 analyzes the spatial complexity of the proposed
approach.                                                                                                    Data. The list of datasets used for the experiments comprises
                                                                                                         all the datasets adopted by [9] plus a couple of larger external
5.1     Experimental setting                                                                             datasets with different data distributions. Specifically, Table 2
To validate the quality of the generated sentiment vectors, we                                           enumerates the characteristics of the existing datasets.
set up a binary sentiment classification task over multilingual                                              The datasets provided by [9] have been extracted from several
datasets. Specifically, given a set of short text snippets labelled                                      websites and collect the reviews left by users on a specific topic
as positive or negative according to their sentiment polarity, we                                        (e.g. places, movies, food). The target binary class (positive or
aim at predicting the sentiment polarities of a subset of related                                        negative) is derived from the user rating. However, users ratings
snippets for which the polarities are assumed to be unknown.                                             are not necessarily binary values (i.e., they usually comply with
   To accomplish the classification task, we train two popular                                           the 5-star system). To generate the binary sentiment polarities
classification models, i.e., the Support Vector Machines (SVM)                                           we have discretized the 5-star ratings as follows: reviews with 3
and the Random Forest (RF) classifiers [15]. Classifiers are first                                       stars are discarded (as they are considered neutral). 1’s and 2’s
trained separately on each multilingual training dataset and then                                        are assigned to the negative class, 4’s and 5’s to the positive one.
applied to a the corresponding test set. More specifically, each                                             The statistics reported in Table 2 clearly show a strong class
dataset is split into a training set (80% of the data) used for training                                 imbalance in the analyzed data. This may hinder the training of
the models and for tuning of the hyper-parameters, and a test                                            robust classifiers, as the minority class may be not sufficiently
set (20%), which is used for performance evaluation. Classifier                                          represented by the trained models. To evaluate the performance
settings are set up according to the outcomes of a grid search                                           of the proposed approach on more balanced data as well, we have
based on a 5-fold Cross-Validation. Separately for each language                                         considered also two additional datasets for the Italian language
and dataset, we evaluate the performance of each classification                                          (i.e., the language for which the imbalance ratio of the corre-
model in terms of macro-F 1 score. The F 1 score is a popular                                            sponding datasets is maximal). Table 3 describes the two new
metric that indicates the harmonic mean of precision and recall                                          Italian datasets. Data was extracted from reviews of TripAdvisor4
of the generated model [15]. Unlike the traditional F 1 score, in the                                    users in different Italian cities.
macro-F 1 score the precision values of each class are multiplied
                                                                                                         4 https://www.tripadvisor.com
with the recall values of all other classes. Hence, the metric is
                Our method       Dong and De Melo                                                     Our method     Dong and De Melo
       Dataset                                                                    Dataset            Metric
               SVM        RF     SVM          RF                                                     SVM      RF      SVM       RF
          cs  0.7403 0.7198 0.7227          0.7297                                        Precision 0.7347 0.7326 0.7203      0.7547
                                                                                 cs
          de   0.6847 0.6981 0.6495         0.6756                                           Recall 0.7593 0.712     0.7474   0.7177
          es  0.6131 0.531       0.4451     0.4892                                        Precision 0.6797 0.7481 0.6563      0.7735
                                                                                 de
          fr   0.7021 0.7291 0.6389         0.6764                                           Recall 0.7372 0.6766 0.7131      0.6507
          it  0.8256 0.794       0.6805     0.6644                                        Precision 0.6111 0.7747 0.4010      0.6181
                                                                                 es
          nl  0.6869 0.6369 0.5903          0.6022                                           Recall 0.6154 0.5428      0.5    0.5172
          ru   0.6840 0.6112 0.7221         0.7009                                        Precision 0.7025 0.7301 0.6488      0.6784
                                                                                 fr
         IT1  0.8439 0.8424 0.7435          0.7311                                           Recall 0.7019 0.7309 0.6403      0.6760
         IT2  0.8441 0.8427 0.7415          0.7494                                        Precision 0.8494 0.8168 0.6750      0.8030
                                                                                 it
Table 4: Comparison, in terms of macro-F 1 score, between                                    Recall 0.8071 0.7765 0.7637      0.6336
the embeddings produced by the proposed methodology                                       Precision 0.6868 0.6651 0.6059      0.6491
                                                                                 nl
(Our method) and those generated by [9] (Dong and De                                         Recall 0.704   0.6317 0.6162     0.6022
Melo)                                                                                     Precision 0.6805 0.6623 0.7151      0.7362
                                                                                 ru
                                                                                             Recall 0.7221 0.6025 0.7634      0.6845
                                                                                          Precision 0.8441 0.8425 0.7442      0.7314
                                                                                IT1
                                                                                             Recall 0.8439 0.8424 0.7436      0.7312
5.2    Performance comparison                                                             Precision 0.8442 0.8428 0.7416      0.7495
                                                                                IT2
Table 4 summarizes the results obtained on the various datasets.                             Recall 0.8441 0.8427 0.7415      0.7495
For each dataset, the performance for SVM and RF are reported                Table 5: Results in terms of macro-precision and macro-
for both the proposed methodology (denoted as Our method) and                recall, for embeddings generated by the proposed method-
for the sentiment embeddings produced by [9] (denoted as Dong                ology (Our method) and with those introduced in [9]
and De Melo). The outcomes of the proposed methodology were                  (Dong and De Melo)
achieved by setting K to 5 and α to 0.4. Subsection 5.3 discusses
the effect of the input parameters on the performance of the
proposed method.
    The proposed methodology for cross-lingual sentiment prop-               improvements. Notice that, to remove the less reliable links, the
agation performs better than the method proposed by [9] in                   word graph is early pruned by enforcing the cut-off threshold
terms of macro-F 1 score on the majority of the analyzed datasets            value α. The impact of the pruning phase is higher while setting
(Russian reviews are the only exception).                                    high K values.
    To gain insight into classifiers’ performance, we explore also
the abilities of the classifier to correctly assign each class label
(i.e., precision) as well as to recognize the largest extend of the test
samples labeled with each class (i.e., recall). Table 5 reports the
                                                                                              0.85
macro-precision and macro-recall values (indicating the means of                              0.84
                                                                           F1 score (macro)


per-class precision and recall values, respectively) [15]. Based on
the achieved results, we can conclude that classifier performance                             0.83
is not biased towards any of the aforesaid metrics. Interestingly,
the embeddings produced by [9] show higher precision for multi-
                                                                                              0.82
ple languages, but the recall is often worse than those achieved by                           0.81
the proposed method (relying on the unified latent space model).
                                                                                              0.80
5.3    Parameter analysis                                                                                                   SVM
                                                                                              0.79                          RF
We study also the effect of setting different values for parameters
K and α on the quality of the generated embeddings. To do so, we                                     1 3 5 7 9 11 13 15 17 19 21
separately analyze their impact on the macro-F 1 scores achieved                                               K
by the binary classifiers. Hereafter, for the sake of brevity, we
will report only the results achieved on a representative dataset
(IT2 ). It is the largest and more balanced dataset among all the            Figure 2: Macro-F 1 score as a function of K, on dataset IT2
tested ones. Similar results were achieved on the other datasets.
   The parameter K indicates the number of neighbors considered                 We separately analyze also the impact of parameter α. Enforc-
while linking the words in the original to those in the target               ing low α values potentially introduces a bias in the graph due to
language. The higher K, the more word relationships are included             the presence of “noisy” links, whereas setting high α values limits
in the word graph. As a drawback, when K is relatively high, the             word graph connectedness. Given an edge (x, x ′, w x x ′ ) ∈ E, the
model may include less relevant or unreliable links. Furthermore,            edge weight w x x ′ is computed as the cosine similarity between x
since the connectedness of the graph increases, the complexity of            and x ′ [15]. The cosine similarity takes (absolute) values between
the sentiment propagation process gets worse (see Section 5.4).              0 (orthogonal vectors) and 1 (parallel vectors). Hence, α has the
   Figure 2 shows how the macro-F 1 score varies as K increases,             same value range. Figure 3 shows how the macro-F 1 score varies
for α = 0.4. The plot highlights a knee in the curve for K = 5.              as α increases. The lower bound set for α (approximately 0.4) is
This implies that, for the purpose of sentiment classification,              the best we can manage using the hardware resources currently
using a larger value of K does not yield significant performance             in use (i.e., setting lower values requires more computational
                                                                          ranges between 80, 000 and 100, 000. Thus, the required memory
                                                             SVM          allocation ranges between 100 and 300 GB.
                   0.8                                       RF              A possible way to optimize the process is to identify connected
                                                                          sub-graphs and to run the Stochastic Gradient Descent separately
                   0.7
F1 score (macro)


                                                                          on each sub-graph. It is potentially feasible because nodes and
                                                                          edges external from the connected sub-graph would not influence
                   0.6                                                    sentiment propagation within the sub-graph. This reduces the
                                                                          size of the processed adjacency matrices, which are stored into
                   0.5                                                    main memory. However, when graphs are highly connected (as
                                                                          in our case), the optimization is not very beneficial. Therefore, as
                                                                          discussed in Section 5.3, in the experimental evaluation reported
                   0.4                                                    in this study we have decided to limit the computational com-
                                                                          plexity of the propagation process by properly setting the K and
                                                                          α parameters.
                         0.4   0.5   0.6    0.7       0.8       0.9
                                                                          6    CONCLUSIONS AND FUTURE WORKS
                                                                          This paper presents a in-progress research study on the use of a
  Figure 3: Macro-F 1 score as a function of α , on dataset IT2           bilingual latent space to propagate sentiment information across
                                                                          multiple languages. The proposed approach overcomes the lim-
  memory). Specifically, the empirical results show that, by set-         itations of the solutions previously proposed in literature due
  ting α to 0.4, the sentiment propagation process converges to a         to the dependence of the propagation phase on the bilingual
  satisfactory solution with the hardware resources available for         lexicon. Our claim is that relying on latent word relationships
  this study. Subsection 5.4 provides a more detailed analysis of         (embedding lexicon information as well) would enhance the pro-
  the space complexity of the problem. Setting lower α (limited           cess of sentiment propagation in cross-lingual and multi-domain
  graph pruning) values yields sentiment embeddings with a higher         contexts.
  quality. Conversely, when high α values are set (specifically, for         We have empirically compared the sentiment embeddings
  values of α ≥ 0.7), the pruning phase is not beneficial.                generated by the proposed methodology with those produced by
                                                                          the approach presented in [9]. Specifically, the embeddings have
  5.4                Complexity analysis                                  been exploited to tackle a binary sentiment analysis problem.
                                                                          The results confirm the initial claim: for most of the considered
  The most computationally intensive step of the proposed method          languages, the propagated information yields better results.
  is sentiment propagation on the word graph based on Stochastic             The presented study leaves room for several extensions. Firstly
  Gradient Descent. It entails computing the gradient of the loss         (and most importantly), we aim at extending the Deep Learning
  function described in Section 4 and then iteratively updating the       process (based on a dual-channels CNN) presented by [9] by
  sentiment polarities until a local minimum is reached.                  embedding the enhanced sentiment vector propagation phase.
      To exploit the hardware optimizations available for matrix          This allows us to fully explore the potential of the new method-
  computations, the gradient can be computed on the entire matrix,        ology in a state-of-the-art Deep Neural Network Architecture
  rather than separately for each weight. Specifically, to process the    for sentiment analysis.
  information embedded in the word graph, an adjacency matrix A              A further exploration will be devoted to identifying the op-
  is defined. Each matrix value Ai j indicates the weight of the edge     timal setting of the α parameter. We plan not only to increase
  linking two arbitrary words x i and x j . If the edge does not exist,   the computational power but also to study more sophisticated
  the corresponding matrix value is zero. Since graph connected-          strategies to optimize the propagation phase as well as to de-
  ness is bounded by the cut-off threshold α, the adjacency matrix        sign greedy strategy able to overcome the limitations due to the
  is rather sparse (the higher α, the sparser A). To compute the          iterative optimization process.
  gradient of the adopted loss function, the adjacency matrix of the         Finally, we plan to test further multilingual datasets. Since
  word graph is loaded into main memory. Large sparse matrices            most of the publicly available datasets are small- or medium-sized
  can be efficiently loaded into main memory by storing only the          and quite imbalanced, we aim at crawling, releasing, and testing
  non-zero elements. However, this limits the types of operations         new data related to various domains and written in different
  that can be performed. Hence, in most cases, a denser in-memory         languages.
  representation would be needed.
      Let N be the size of the initial vocabulary Vo in the original
                                                                          7    ACKNOWLEDGEMENTS
  language (English, in our case). For each word in Vo , K neighbor
  words are selected from the target language. Hence, in the worst        This work has been partially supported by the SmartData@PoliTO
  case, the word graph contains (K + 1)N words. Since part of             center on Big Data and Data Science.
  the neighbors in the target languages are overlapped, we can
  assume, to a good approximation, that each word in the original         REFERENCES
  language has one translation in the target language, yielding            [1] Oscar Araque, Ignacio Corcuera-Platas, J. Fernando Sánchez-Rada, and Car-
                                                                               los A. Iglesias. 2017. Enhancing deep learning sentiment analysis with en-
  2N words in the resulting graph. The corresponding adjacency                 semble techniques in social applications. Expert Systems with Applications 77
  matrix consists of 4N 2 cells. Let us assume to use B bytes to               (2017), 236 – 246. https://doi.org/10.1016/j.eswa.2017.02.002
  represent each floating point number (where B = 4 or B = 8               [2] Oscar Araque, Ignacio Corcuera-Platas, J. Fernando Sánchez-Rada, and Car-
                                                                               los A. Iglesias. 2017. Enhancing deep learning sentiment analysis with en-
  in modern systems), the total adjacency matrix size is 4BN 2 .               semble techniques in social applications. Expert Systems with Applications 77
  On the other hand, the size N of the initial English vocabulary              (2017), 236 – 246. https://doi.org/10.1016/j.eswa.2017.02.002
 [3] Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multi-
     lingual subjectivity analysis using machine translation. In Proceedings of the
     Conference on Empirical Methods in Natural Language Processing. Association
     for Computational Linguistics, Honolulu, Hawaii, USA, 127–135.
 [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.
     Enriching Word Vectors with Subword Information. Transactions of the Asso-
     ciation for Computational Linguistics 5 (2017), 135–146.
 [5] Yanqing Chen and Steven Skiena. 2014. Building sentiment lexicons for all
     major languages. In Proceedings of the 52nd Annual Meeting of the Associa-
     tion for Computational Linguistics (Volume 2: Short Papers). Association for
     Computational Linguistics, Baltimore, Maryland, USA, 383–389.
 [6] Gerard de Melo. 2014. Etymological Wordnet: Tracing The History of
     Words. In Proceedings of the Ninth International Conference on Language
     Resources and Evaluation (LREC’14). European Language Resources Asso-
     ciation (ELRA), Reykjavik, Iceland, 1148–1154. http://www.lrec-conf.org/
     proceedings/lrec2014/pdf/1083_Paper.pdf
 [7] Erkin Demirtas and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection
     with machine translation. In Proceedings of the Second International Workshop
     on Issues of Sentiment Discovery and Opinion Mining. ACM, Chicago, USA, 9.
 [8] Hai Ha Do, PWC Prasad, Angelika Maag, and Abeer Alsadoon. 2019. Deep
     Learning for Aspect-Based Sentiment Analysis: A Comparative Review. Expert
     Systems with Applications 118 (2019), 272 – 299. https://doi.org/10.1016/j.eswa.
     2018.10.003
 [9] Xin Dong and Gerard de Melo. 2018. Cross-Lingual Propagation for Deep
     Sentiment Analysis. In Proceedings of the Thirty-Second AAAI Conference on
     Artificial Intelligence, (AAAI-18). AAAI Press, New Orleans, Louisiana, USA,
     5771–5778.
[10] Kevin Duh, Akinori Fujino, and Masaaki Nagata. 2011. Is machine trans-
     lation ripe for cross-lingual sentiment classification?. In Proceedings of the
     49th Annual Meeting of the Association for Computational Linguistics: Human
     Language Technologies: short papers-Volume 2. Association for Computational
     Linguistics, Portland, Oregon, USA, 429–433.
[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning.
     MIT Press. http://www.deeplearningbook.org.
[12] Hitkul Jangid, Shivangi Singhal, Rajiv Ratn Shah, and Roger Zimmermann.
     2018. Aspect-Based Financial Sentiment Analysis Using Deep Learning. In
     Companion Proceedings of the The Web Conference 2018 (WWW ’18). Inter-
     national World Wide Web Conferences Steering Committee, Republic and
     Canton of Geneva, CHE, 1961–1966. https://doi.org/10.1145/3184558.3191827
[13] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard
     Grave. 2018. Loss in Translation: Learning Bilingual Word Mapping with a
     Retrieval Criterion. In Proceedings of the 2018 Conference on Empirical Methods
     in Natural Language Processing.
[14] Ji Young Lee and Franck Dernoncourt. 2016. Sequential Short-Text Clas-
     sification with Recurrent and Convolutional Neural Networks. In Proceed-
     ings of the 2016 Conference of the North American Chapter of the Associa-
     tion for Computational Linguistics: Human Language Technologies. Associa-
     tion for Computational Linguistics, San Diego, California, 515–520. https:
     //doi.org/10.18653/v1/N16-1062
[15] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of
     Massive Datasets (2nd ed.). Cambridge University Press, New York, NY, USA.
[16] Bing Liu. 2015.                   Sentiment Analysis - Mining Opin-
     ions, Sentiments, and Emotions.                         Cambridge University
     Press.                       http://www.cambridge.org/us/academic/subjects/
     computer-science/knowledge-management-databases-and-data-mining/
     sentiment-analysis-mining-opinions-sentiments-and-emotions
[17] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regulari-
     ties in Continuous Space Word Representations. In Human Language Technolo-
     gies: Conference of the North American Chapter of the Association of Computa-
     tional Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, At-
     lanta, Georgia, USA. 746–751. https://www.aclweb.org/anthology/N13-1090/
[18] Aliaksei Severyn and Alessandro Moschitti. 2015. Twitter Sentiment Analysis
     with Deep Convolutional Neural Networks. In Proceedings of the 38th Inter-
     national ACM SIGIR Conference on Research and Development in Information
     Retrieval (SIGIR ’15). Association for Computing Machinery, New York, NY,
     USA, 959–962. https://doi.org/10.1145/2766462.2767830
[19] David Vilares, Carlos Gomez-Rodriguez, and Miguel A. Alonso. 2017. Uni-
     versal, unsupervised (rule-based), uncovered sentiment analysis. Knowledge-
     Based Systems 118 (2017), 45 – 55. https://doi.org/10.1016/j.knosys.2016.11.014
[20] Xiaojun Wan. 2009. Co-Training for Cross-Lingual Sentiment Classification.
     In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL
     and the 4th International Joint Conference on Natural Language Processing of
     the AFNLP. Association for Computational Linguistics, Suntec, Singapore,
     235–243. https://www.aclweb.org/anthology/P09-1027