Cross-Lingual Propagation of Sentiment Information Based on Bilingual Vector Space Alignment Flavio Giobergia Luca Cagliero Politecnico di Torino Politecnico di Torino Turin, Italy Turin, Italy flavio.giobergia@polito.it luca.cagliero@polito.it Paolo Garza Elena Baralis Politecnico di Torino Politecnico di Torino Turin, Italy Turin, Italy paolo.garza@polito.it elena.baralis@polito.it ABSTRACT sentiment embeddings are known (usually in English) and a Deep learning methods have shown to be particularly effective lexicon that maps English words to those in the target language. in inferring the sentiment polarity of a text snippet. However, in The mapping indicates the semantic relationship between pairs of cross-domain and cross-lingual scenarios there is often a lack of words. First, the word-level sentiment polarity in various domains training data. To tackle this issue, propagation algorithms can is extracted in the source language using a supervised transfer be used to yield sentiment information for various languages learning process. Then, the vector scores for the target language and domains by transferring knowledge from a source language are induced using stochastic gradient descent. A more detailed (usually English). To propagate polarity scores to the target lan- description of [9] is given in Section 3. guage, these algorithms take as input an initial vocabulary and a Challenge. The quality of the sentiment score propagation bilingual lexicon. In this paper we propose to enrich lexicon in- strongly depends on the richness of the bilingual lexicon. When formation for cross-lingual propagation by inferring the bilingual a bilingual lexicon is either not available or partly incomplete, the semantic relationships from an aligned bilingual vector space. induction phase is unable to effectively propagate the sentiment This allows us to exploit the underlying text similarities that are polarity scores from the original language to the target one. not made explicit by the lexicon. The experiments show that our approach outperforms the state-of-the-art propagation method Research goals. The goal of this work is to improve the quality on multilingual datasets. of the sentiment propagation phase across languages. The key idea is to enrich lexicon information in the propagation phase by 1 INTRODUCTION deriving the semantic links among word pairs from an aligned bilingual vector space. This allows us to exploit the underlying In the last decade, an increasing amount of opinionated data text similarities that are not made explicit in the bilingual lexi- has been recorded in digital forms (e.g., reviews, tweets, blogs). con. We use an established model for vector representation of This has fostered the joint use of Natural Language Processing words, i.e., fastText [4]. A deep learning approach to generate (NLP) and Machine Learning (ML) techniques to extract people’s aligned fastText word vectors has recently been proposed [13]. opinions, sentiments, emotions, and attitude from text, i.e., the Once trained, the bilingual vector spaces not only embed lexi- sentiment analysis (or opinion mining) problems [16]. con information but also allow us to derive non-trivial semantic The recently proposed approaches (e.g., [1, 8, 9]) aim to pre- text relationships directly from the latent space. This simplifies dict the sentiment polarity of the analyzed text by means of deep the procedure of cross-lingual induction and exploits the vector learning techniques. However, Deep Neural Networks require a representation of text in the latent space to infer missing word sufficiently large corpus of labeled data in order to train accurate relationships. The authors of [13] have also published the pre- sentiment predictors [11]. Meeting such a requirement could be trained aligned vectors for a large number of languages. Hence, challenging while coping with multilingual and cross-domain a promptly usable, general-purpose vector representation of text data. In particular, the majority of the annotated text is written in is currently available. English whereas small amounts of data are available for less com- monly spoken languages. Furthermore, the sentiment of a text Approach. To propagate the multi-domain sentiment polar- snippet strongly depends on its surrounding context. For exam- ity scores of a word in the original language (e.g., English), we ple, a word may have different connotations in different domains. explore the bilingual aligned vector space. Specifically, an arbi- Hence, tailoring DNN models to the right domain and language is trary word in the original vocabulary is described by two high- crucial for developing accurate and portable sentiment analyzers. dimensional vectors: a latent vector in the original embedding A promising strategy to overcome the lack of multilingual space and a sentiment vector describing the sentiment polarity of training data has recently been proposed by [9]. They propose the word in different domains. Thanks to the bilingual model, we an approach to propagate sentiment information, encoded into project words from the original word embedding to the vector high-dimensional embedding vectors [17], across languages. The space of the target language and look for its nearest neighbors. idea behind this is to consider an initial vocabulary for which Neighbors are likely to be semantically related to the original © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- word (regardless of the presence of an explicit link in the bilingual ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, lexicon). The semantic links and the similarity scores between the Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- tribution 4.0 International (CC BY 4.0) projected word and the neighbors are used to drive the sentiment propagation phase towards the target language. Achievements. The proposed approach produces sentiment domains1 . If the same word is used in multiple languages, each embeddings that outperform the state-of-the-art embeddings instance is treated as a distinct word in V . by [9] on various multilingual benchmark datasets (e.g., using The sentiment vector associated with each word x in the initial the embeddings with an SVM classifier, it achieves 10% average vocabulary V0 is derived via transfer learning. Specifically, for macro F 1 score improvement on the Italian datasets). each domain d j a linear Support Vector Machine classifier [15] M j (x) = w j ·x +b is trained from a set of domain-specific textual Paper outline. Section 2 overviews the related literature. Sec- documents. The classifier assigns a polarity to each word, denot- tion 3 summarizes the cross-lingual propagation method pre- ing whether the word is peculiar to the domain under analysis. sented by [9] and introduces the mathematical notation used The j-th component of the embedding vector v x incorporates the throughout the paper. Section 4 presents the proposed approach. coefficient w j of the linear model M j . Hereafter, we will assume Section 5 reports the outcomes of the empirical evaluation. Fi- sentiment information to be known for an initial vocabulary nally, Section 6 draws conclusions and presents the future works. V0 ⊂ V , which usually consists of a subset of English words (i.e., the most popular language used in electronic documents). The translation lexicon TL is a set of triples {(x 1, x 1′ , we 1 ),. . ., 2 RELATED WORK (xm , xm ′ , we )} [x , x ′ , . . . , x , x ′ ∈ V ] providing evidence of m 1 1 m m To predict the sentiment polarity of textual reviews, news, and the semantic relationships holding between pairs of words in the posts, several deep learning-based sentiment analysis approaches multilingual vocabulary. The lexicon maps words in the origi- have been proposed. Most of them (e.g., [2, 12, 18]) are language- nal language to the corresponding translations. Notice that each or domain-specific, i.e., they are specifically tailored to a given word may have multiple translations. To incorporate relation- context (e.g., movie reviews, Twitter posts) and language. Hence, ships such as synonyms, orthographic variants, and etymological model learning assumes that a large enough training set is avail- connections, the lexicon includes also links between pairs of able. Unfortunately, in many real contexts and for various lan- words of the same language. In [9] the translation lexicon is guages this is not the case. extracted from a multilingual Wiktionary dump [6]. The links be- To extend the applicability of existing sentiment analysis so- tween pairs of words represent translations, synonyms, morphol- lutions towards other languages, the use of automated machine ogy, derivation, and etymological links. The weight associated translation tool has been investigated [3, 7, 10, 20]. The main with a triple (xq , xq′ , weq ) [1 ≤ q ≤ m] denotes the relevance drawbacks of using automatic text translation tools are that the of the semantic relationship. In [6] it indicates the number of process is computationally intensive, the generated translations semantic links occurring in the data source. are prone to errors, and the tools often miss the word semantic The sentiment vectors of all the words in V are populated by differences according to the context of use [10]. Parallel strate- propagating the cross-domain polarity scores of the words in V0 gies entail (i) building sentiment lexicons tailored to different via an iterative optimization approach, i.e., Stochastic Gradient languages and domains of interest and exploiting them to train Descent (SGD). The optimization problem addressed by the SGD supervised models [5] and (ii) integrate syntax-based rules in entails assigning values to the sentiment vector v x for all words unsupervised models [19]. However, all the aforementioned ap- x in the multilingual vocabulary V according to the following proaches require a significant human effort, which has already objective function: been accomplished only for the major languages and the most popular domains. Õ C· ||v x − ṽ x ||2 + To propagate sentiment information across different languages x ∈V0 and domains, a deep learning approach has recently been pre- " # sented [9] by Dong and De Melo. They consider an initial vo- Õ 1 Õ − vTx · we v x′ cabulary of English words for which sentiment embeddings are (x ,x ′ ,we)∈TL we Í x ∈V (x ,x ′ ,we)∈TL known and a translation lexicon representing semantic relation- ships between pairs of words (both non-English and English where, given a word x ∈ V0 , ṽ x represents its initial sentiment words) such as translation, synonym, orthographic variants, and vector (learned through transfer learning). other semantic, morphological, and etymological word relation- The first term of the loss function ensures that the sentiment ships is given. In [9], links between words are extracted from a vectors of the words in the initial vocabulary V0 do not diverge multilingual Wiktionary dump [6]. However, for the languages significantly from the original ones, for a large enough constant C. and domains not yet supported by the Etymological Wordnet The second term guarantees that the inferred sentiment vectors of Project, still the method is unable to propagate sentiment in- words that are linked together (to some extent) in the translation formation. Furthermore, some relevant semantic links could be lexicon are kept similar. A drawback of the aforesaid loss function missing in the input lexicon. Our proposal is to rely on an aligned is that the dot product in the second term allows for arbitrarily bilingual vector space, e.g., [13], from which explicit and implicit large magnitudes for the inferred sentiment vectors. Indeed, the semantic relationships among words can be inferred. dot product can grow by indefinitely increasing the magnitude of the vectors that are being learned. This issue will be addressed by the proposed approach. 3 SENTIMENT PROPAGATION BASED ON TRANSLATION LEXICON 4 PROPOSED APPROACH To propagate sentiment information to various languages, the ap- Instead of propagating sentiment information directly by means proach proposed by [9] generates a sentiment embedding vector of the translation lexicon, we aim to link the semantically related v x for each word x in a multilingual vocabulary V . A sentiment words indirectly according to their similarity in a bilingual word embedding is a high-dimensional vector reflecting the distribu- 1 The vectors used in the experiments have 26 dimensions, one for each domain tion of the word’s sentiment polarities across a large range of plus an extra dimension combining all domains together. embedding space. The idea behind this is to improve the quality x x’s nearest neighbors Cosine similarity of the cross-lingual propagation phase by projecting polarity excellent eccellente 0.575 scores extracted from a richer text representation based on latent ottimo 0.513 spaces. apprezzabile 0.369 buon 0.367 4.1 Bilingual embedding space adatto 0.322 Each word in a dictionary is mapped to a vector in the latent Table 1: Example: K nearest neighbors of excellent. Origi- space. The application of word embeddings to address many nat- nal language = English, target language = Italian, K = 5 ural language processing tasks is established. A pioneering work in this field is the Word2Vec model [17]. fastText is a famous extension of Word2Vec, which has been presented in [4]. It pro- vides a more effective vector representation by incorporating 10 9.5 9.9 9.8 9.9 sub-words in the input dictionary. The vectors associated with the sub-words can be conveniently combined in order to gen- excellent erate the embeddings of new words that are not present in the 0.5 3 75 0.51 dictionary. Vector representations of text are generated, separately for each language, using deep learning architectures. However, pre- ottimo eccellente trained vector models (learned from the Wikipedia corpus) are also available for a large number of languages2 . To links words in different languages, the per-language models need to be aligned first. The procedure to align bilingual fastText vector spaces is Figure 1: Example: word sub-graph associated with excel- thoroughly described in [13]. Notably, a large number of pre- lent. Original language = English, target language = Italian, trained aligned models is available3 . This allows users to exploit K = 5, α = 0.4 the general-purpose, multilingual vectors (characterized by 300 dimensions and trained from Wikipedia for 44 languages) without the need for retraining them from scratch. weight of the edge connecting words x and x ′ indicates the pair- wise word similarity in the latent space and is computed using 4.2 Sentiment propagation strategy the cosine similarity [15]. To avoid introducing unreliable word Let Eo be the fastText embedding space in the original language relationships and to limit graph connectedness, we filter out the (e.g., English) and let ET be the aligned embedding space in the edges (links) with weight below a given (user-specified) thresh- target language (i.e., a language other than English). Each word old α. The effect of parameters K and α on the performance x in the original language has a corresponding vector v xeo in Eo . and complexity of the proposed approach will be discussed in Thanks to the aligned bilingual vector space, we can project v xeo Section 5. to the target vector space in order to get the corresponding target Example. Suppose that the original language is English and vector v xet in ET . Notice that the new vector does not necessarily the target language is Italian. Let us consider the following input correspond to any real word in the target language. parameters: K = 5 and α = 0.4. Table 1 and Figure 1 report an We exploit word similarities in the latent space to propagate example related to the English word excellent. Specifically, Table 1 sentiment information. Specifically, let v xs be the sentiment vector reports the five nearest neighbors of excellent while Figure 1 of an arbitrary word x ∈ V0 . We aim to propagate sentiment shows the word sub-graph associated with that word. Only the information to other words in V \V0 . This issue is addressed in first two neighbors of excellent are characterized by a cosine two steps: (i) first, we create a word graph representing the most similarity greater than or equal to 0.4. Hence, only the Italian significant pairwise word similarities. (ii) Next, we propagate words eccellente and ottimo are connected to excellent in the word the sentiment scores to the other words using gradient descent. graph G. This is semantically correct because eccellente is the Unlike [9], we adopt a new loss function tailored to the problem Italian translation of excellent and ottimo has a similar meaning. under analysis. Notice that step (i) allows us to get a richer word The three discarded neighbors are other “positive” adjectives but representation compared to a bilingual lexicon. they do not have the same meaning of excellent (the translations of the other three neighbors are appreciable, good, and suitable, Word Graph Creation. The word graph G = (V , E) is a undi- respectively). Hence, the enforcement of the minimum similarity rected weighted graph connecting pairs of words in V . Edges threshold helps us to remove noisy connections. in E are triples (x, x ′ , w x x ′ ), where x, x ′ ∈ V are the connected The English word excellent, which is one of the words in V0 , is vertices and w x x ′ is the edge weight. For each word x ∈ V we characterized by a sentiment embedding (i.e., a vector of cross- explore the neighborhood of vector v xet in the target latent space domain polarity scores). The sentiment vectors of the Italian to look for the neighbor words that are most semantically related words are populated by propagating the cross-domain polarity to x. More specifically, we look for the K nearest vectors (where K scores of the English words via the iterative optimization ap- is a user-specified parameter) corresponding to the words of the proach described in the following. target languages that are closest to v xet and select these words. Given the set N N x of x’s nearest neighbors, we create a Gradient Descent with Updated Loss Function. The Gradient weighted edge e ∈ E connecting every x ′ ∈ N N x to x. The Descent is used to propagate sentiment information through the word graph. As discussed in Section 3, the iterative propagation 2 https://fasttext.cc/ process should both preserve the values of the vectors in the 3 https://fasttext.cc/docs/en/aligned-vectors.html initial vocabulary V0 and guarantee a high degree of similarity between the sentiment vectors of linked words. To achieve these Dataset Cardinality #Positive #Negative goals, we adopt the following objective function: cs 2,458 1,660 798 de 2,407 1,839 568 Õ es 2,951 2,367 584 C· ||v x − ṽ x ||2 + fr 3,912 2,080 1,832 x ∈V0 it 3,559 2,867 692 nl 1,892 1,232 660 " # Õ 1 Õ − vx − Í · w x x ′ vx ′ ru 3,414 2,500 914 x ∈V (x ,x ′ ,w x x ′ )∈E w x x ′ (x ,x ′ ,w x x ′ )∈E 2 Table 2: Cardinality and class distribution for each of the where, for each word x in the initial vocabulary V0 , the first term datasets presented in [9] minimizes the deviation from its initial sentiment embedding vector ṽ x . The second term minimizes the deviation from the sentiment vectors of neighbors, represented as connected words Dataset Cardinality #Positive #Negative in the word graph. Adopting the L2-norm in the second terms IT1 10,024 5,012 5,012 allows the propagation of the vector dimensions without altering IT2 13,888 6,942 6,946 the vector magnitude. Therefore, words in the initial vocabulary Table 3: Key statistics for the new Italian datasets keep, to a good approximation, the same original vectors, whereas new words get sentiment polarity scores similar to those of their neighbors in the target latent space. 5 EXPERIMENTS more deemed as more suitable for evaluating imbalanced datasets, i.e., datasets for which the class labels are unevenly distributed The experiments presented in this section are aimed at evalu- in the training data. ating the quality of the sentiment vectors resulting from the Classifiers are trained on a vector representation of the input application of the proposed methodology. The evaluation pro- text snippets. The vectors associated with each snippet are com- cess is formulated as a binary sentiment analysis problem. The puted by averaging the values of the vector dimensions of the sentiment embeddings are compared, in terms of macro-F 1 score, words included in the snippet. Notice that the aforesaid task could with those produced by the method presented by [9]. be alternatively addressed using Recurrent Neural Networks and All the experiments were run on a machine equipped with Convolutional Neural Networks [14]. The comparison between Intel® Xeon® X5650, 32 GB of RAM and running Ubuntu 18.04.1 different Deep Learning techniques is out of the scope of the LTS. present work. The rest of this section is organized as follows. Subsection 5.1 Classifier performance achieved on the sentiment vectors pro- describes the settings used in the experimental validation as duced by our method are compared with that of the vectors well as the analyzed datasets. Subsection 5.2 summarizes the produced by [9]. The comparison is aimed at showing the higher main results. Subsection 5.3 discusses the influences of the main effectiveness of the proposed approach compared to state-of-the- parameters of the performance of the proposed approach. Finally, art solutions. Subsection 5.4 analyzes the spatial complexity of the proposed approach. Data. The list of datasets used for the experiments comprises all the datasets adopted by [9] plus a couple of larger external 5.1 Experimental setting datasets with different data distributions. Specifically, Table 2 To validate the quality of the generated sentiment vectors, we enumerates the characteristics of the existing datasets. set up a binary sentiment classification task over multilingual The datasets provided by [9] have been extracted from several datasets. Specifically, given a set of short text snippets labelled websites and collect the reviews left by users on a specific topic as positive or negative according to their sentiment polarity, we (e.g. places, movies, food). The target binary class (positive or aim at predicting the sentiment polarities of a subset of related negative) is derived from the user rating. However, users ratings snippets for which the polarities are assumed to be unknown. are not necessarily binary values (i.e., they usually comply with To accomplish the classification task, we train two popular the 5-star system). To generate the binary sentiment polarities classification models, i.e., the Support Vector Machines (SVM) we have discretized the 5-star ratings as follows: reviews with 3 and the Random Forest (RF) classifiers [15]. Classifiers are first stars are discarded (as they are considered neutral). 1’s and 2’s trained separately on each multilingual training dataset and then are assigned to the negative class, 4’s and 5’s to the positive one. applied to a the corresponding test set. More specifically, each The statistics reported in Table 2 clearly show a strong class dataset is split into a training set (80% of the data) used for training imbalance in the analyzed data. This may hinder the training of the models and for tuning of the hyper-parameters, and a test robust classifiers, as the minority class may be not sufficiently set (20%), which is used for performance evaluation. Classifier represented by the trained models. To evaluate the performance settings are set up according to the outcomes of a grid search of the proposed approach on more balanced data as well, we have based on a 5-fold Cross-Validation. Separately for each language considered also two additional datasets for the Italian language and dataset, we evaluate the performance of each classification (i.e., the language for which the imbalance ratio of the corre- model in terms of macro-F 1 score. The F 1 score is a popular sponding datasets is maximal). Table 3 describes the two new metric that indicates the harmonic mean of precision and recall Italian datasets. Data was extracted from reviews of TripAdvisor4 of the generated model [15]. Unlike the traditional F 1 score, in the users in different Italian cities. macro-F 1 score the precision values of each class are multiplied 4 https://www.tripadvisor.com with the recall values of all other classes. Hence, the metric is Our method Dong and De Melo Our method Dong and De Melo Dataset Dataset Metric SVM RF SVM RF SVM RF SVM RF cs 0.7403 0.7198 0.7227 0.7297 Precision 0.7347 0.7326 0.7203 0.7547 cs de 0.6847 0.6981 0.6495 0.6756 Recall 0.7593 0.712 0.7474 0.7177 es 0.6131 0.531 0.4451 0.4892 Precision 0.6797 0.7481 0.6563 0.7735 de fr 0.7021 0.7291 0.6389 0.6764 Recall 0.7372 0.6766 0.7131 0.6507 it 0.8256 0.794 0.6805 0.6644 Precision 0.6111 0.7747 0.4010 0.6181 es nl 0.6869 0.6369 0.5903 0.6022 Recall 0.6154 0.5428 0.5 0.5172 ru 0.6840 0.6112 0.7221 0.7009 Precision 0.7025 0.7301 0.6488 0.6784 fr IT1 0.8439 0.8424 0.7435 0.7311 Recall 0.7019 0.7309 0.6403 0.6760 IT2 0.8441 0.8427 0.7415 0.7494 Precision 0.8494 0.8168 0.6750 0.8030 it Table 4: Comparison, in terms of macro-F 1 score, between Recall 0.8071 0.7765 0.7637 0.6336 the embeddings produced by the proposed methodology Precision 0.6868 0.6651 0.6059 0.6491 nl (Our method) and those generated by [9] (Dong and De Recall 0.704 0.6317 0.6162 0.6022 Melo) Precision 0.6805 0.6623 0.7151 0.7362 ru Recall 0.7221 0.6025 0.7634 0.6845 Precision 0.8441 0.8425 0.7442 0.7314 IT1 Recall 0.8439 0.8424 0.7436 0.7312 5.2 Performance comparison Precision 0.8442 0.8428 0.7416 0.7495 IT2 Table 4 summarizes the results obtained on the various datasets. Recall 0.8441 0.8427 0.7415 0.7495 For each dataset, the performance for SVM and RF are reported Table 5: Results in terms of macro-precision and macro- for both the proposed methodology (denoted as Our method) and recall, for embeddings generated by the proposed method- for the sentiment embeddings produced by [9] (denoted as Dong ology (Our method) and with those introduced in [9] and De Melo). The outcomes of the proposed methodology were (Dong and De Melo) achieved by setting K to 5 and α to 0.4. Subsection 5.3 discusses the effect of the input parameters on the performance of the proposed method. The proposed methodology for cross-lingual sentiment prop- improvements. Notice that, to remove the less reliable links, the agation performs better than the method proposed by [9] in word graph is early pruned by enforcing the cut-off threshold terms of macro-F 1 score on the majority of the analyzed datasets value α. The impact of the pruning phase is higher while setting (Russian reviews are the only exception). high K values. To gain insight into classifiers’ performance, we explore also the abilities of the classifier to correctly assign each class label (i.e., precision) as well as to recognize the largest extend of the test samples labeled with each class (i.e., recall). Table 5 reports the 0.85 macro-precision and macro-recall values (indicating the means of 0.84 F1 score (macro) per-class precision and recall values, respectively) [15]. Based on the achieved results, we can conclude that classifier performance 0.83 is not biased towards any of the aforesaid metrics. Interestingly, the embeddings produced by [9] show higher precision for multi- 0.82 ple languages, but the recall is often worse than those achieved by 0.81 the proposed method (relying on the unified latent space model). 0.80 5.3 Parameter analysis SVM 0.79 RF We study also the effect of setting different values for parameters K and α on the quality of the generated embeddings. To do so, we 1 3 5 7 9 11 13 15 17 19 21 separately analyze their impact on the macro-F 1 scores achieved K by the binary classifiers. Hereafter, for the sake of brevity, we will report only the results achieved on a representative dataset (IT2 ). It is the largest and more balanced dataset among all the Figure 2: Macro-F 1 score as a function of K, on dataset IT2 tested ones. Similar results were achieved on the other datasets. The parameter K indicates the number of neighbors considered We separately analyze also the impact of parameter α. Enforc- while linking the words in the original to those in the target ing low α values potentially introduces a bias in the graph due to language. The higher K, the more word relationships are included the presence of “noisy” links, whereas setting high α values limits in the word graph. As a drawback, when K is relatively high, the word graph connectedness. Given an edge (x, x ′, w x x ′ ) ∈ E, the model may include less relevant or unreliable links. Furthermore, edge weight w x x ′ is computed as the cosine similarity between x since the connectedness of the graph increases, the complexity of and x ′ [15]. The cosine similarity takes (absolute) values between the sentiment propagation process gets worse (see Section 5.4). 0 (orthogonal vectors) and 1 (parallel vectors). Hence, α has the Figure 2 shows how the macro-F 1 score varies as K increases, same value range. Figure 3 shows how the macro-F 1 score varies for α = 0.4. The plot highlights a knee in the curve for K = 5. as α increases. The lower bound set for α (approximately 0.4) is This implies that, for the purpose of sentiment classification, the best we can manage using the hardware resources currently using a larger value of K does not yield significant performance in use (i.e., setting lower values requires more computational ranges between 80, 000 and 100, 000. Thus, the required memory SVM allocation ranges between 100 and 300 GB. 0.8 RF A possible way to optimize the process is to identify connected sub-graphs and to run the Stochastic Gradient Descent separately 0.7 F1 score (macro) on each sub-graph. It is potentially feasible because nodes and edges external from the connected sub-graph would not influence 0.6 sentiment propagation within the sub-graph. This reduces the size of the processed adjacency matrices, which are stored into 0.5 main memory. However, when graphs are highly connected (as in our case), the optimization is not very beneficial. Therefore, as discussed in Section 5.3, in the experimental evaluation reported 0.4 in this study we have decided to limit the computational com- plexity of the propagation process by properly setting the K and α parameters. 0.4 0.5 0.6 0.7 0.8 0.9 6 CONCLUSIONS AND FUTURE WORKS This paper presents a in-progress research study on the use of a Figure 3: Macro-F 1 score as a function of α , on dataset IT2 bilingual latent space to propagate sentiment information across multiple languages. The proposed approach overcomes the lim- memory). Specifically, the empirical results show that, by set- itations of the solutions previously proposed in literature due ting α to 0.4, the sentiment propagation process converges to a to the dependence of the propagation phase on the bilingual satisfactory solution with the hardware resources available for lexicon. Our claim is that relying on latent word relationships this study. Subsection 5.4 provides a more detailed analysis of (embedding lexicon information as well) would enhance the pro- the space complexity of the problem. Setting lower α (limited cess of sentiment propagation in cross-lingual and multi-domain graph pruning) values yields sentiment embeddings with a higher contexts. quality. Conversely, when high α values are set (specifically, for We have empirically compared the sentiment embeddings values of α ≥ 0.7), the pruning phase is not beneficial. generated by the proposed methodology with those produced by the approach presented in [9]. Specifically, the embeddings have 5.4 Complexity analysis been exploited to tackle a binary sentiment analysis problem. The results confirm the initial claim: for most of the considered The most computationally intensive step of the proposed method languages, the propagated information yields better results. is sentiment propagation on the word graph based on Stochastic The presented study leaves room for several extensions. Firstly Gradient Descent. It entails computing the gradient of the loss (and most importantly), we aim at extending the Deep Learning function described in Section 4 and then iteratively updating the process (based on a dual-channels CNN) presented by [9] by sentiment polarities until a local minimum is reached. embedding the enhanced sentiment vector propagation phase. To exploit the hardware optimizations available for matrix This allows us to fully explore the potential of the new method- computations, the gradient can be computed on the entire matrix, ology in a state-of-the-art Deep Neural Network Architecture rather than separately for each weight. Specifically, to process the for sentiment analysis. information embedded in the word graph, an adjacency matrix A A further exploration will be devoted to identifying the op- is defined. Each matrix value Ai j indicates the weight of the edge timal setting of the α parameter. We plan not only to increase linking two arbitrary words x i and x j . If the edge does not exist, the computational power but also to study more sophisticated the corresponding matrix value is zero. Since graph connected- strategies to optimize the propagation phase as well as to de- ness is bounded by the cut-off threshold α, the adjacency matrix sign greedy strategy able to overcome the limitations due to the is rather sparse (the higher α, the sparser A). To compute the iterative optimization process. gradient of the adopted loss function, the adjacency matrix of the Finally, we plan to test further multilingual datasets. Since word graph is loaded into main memory. Large sparse matrices most of the publicly available datasets are small- or medium-sized can be efficiently loaded into main memory by storing only the and quite imbalanced, we aim at crawling, releasing, and testing non-zero elements. However, this limits the types of operations new data related to various domains and written in different that can be performed. Hence, in most cases, a denser in-memory languages. representation would be needed. Let N be the size of the initial vocabulary Vo in the original 7 ACKNOWLEDGEMENTS language (English, in our case). For each word in Vo , K neighbor words are selected from the target language. Hence, in the worst This work has been partially supported by the SmartData@PoliTO case, the word graph contains (K + 1)N words. Since part of center on Big Data and Data Science. the neighbors in the target languages are overlapped, we can assume, to a good approximation, that each word in the original REFERENCES language has one translation in the target language, yielding [1] Oscar Araque, Ignacio Corcuera-Platas, J. Fernando Sánchez-Rada, and Car- los A. Iglesias. 2017. Enhancing deep learning sentiment analysis with en- 2N words in the resulting graph. The corresponding adjacency semble techniques in social applications. Expert Systems with Applications 77 matrix consists of 4N 2 cells. Let us assume to use B bytes to (2017), 236 – 246. https://doi.org/10.1016/j.eswa.2017.02.002 represent each floating point number (where B = 4 or B = 8 [2] Oscar Araque, Ignacio Corcuera-Platas, J. Fernando Sánchez-Rada, and Car- los A. Iglesias. 2017. Enhancing deep learning sentiment analysis with en- in modern systems), the total adjacency matrix size is 4BN 2 . semble techniques in social applications. Expert Systems with Applications 77 On the other hand, the size N of the initial English vocabulary (2017), 236 – 246. https://doi.org/10.1016/j.eswa.2017.02.002 [3] Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multi- lingual subjectivity analysis using machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii, USA, 127–135. [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Asso- ciation for Computational Linguistics 5 (2017), 135–146. [5] Yanqing Chen and Steven Skiena. 2014. Building sentiment lexicons for all major languages. In Proceedings of the 52nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Baltimore, Maryland, USA, 383–389. [6] Gerard de Melo. 2014. Etymological Wordnet: Tracing The History of Words. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Asso- ciation (ELRA), Reykjavik, Iceland, 1148–1154. http://www.lrec-conf.org/ proceedings/lrec2014/pdf/1083_Paper.pdf [7] Erkin Demirtas and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining. ACM, Chicago, USA, 9. [8] Hai Ha Do, PWC Prasad, Angelika Maag, and Abeer Alsadoon. 2019. Deep Learning for Aspect-Based Sentiment Analysis: A Comparative Review. Expert Systems with Applications 118 (2019), 272 – 299. https://doi.org/10.1016/j.eswa. 2018.10.003 [9] Xin Dong and Gerard de Melo. 2018. Cross-Lingual Propagation for Deep Sentiment Analysis. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18). AAAI Press, New Orleans, Louisiana, USA, 5771–5778. [10] Kevin Duh, Akinori Fujino, and Masaaki Nagata. 2011. Is machine trans- lation ripe for cross-lingual sentiment classification?. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, Portland, Oregon, USA, 429–433. [11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org. [12] Hitkul Jangid, Shivangi Singhal, Rajiv Ratn Shah, and Roger Zimmermann. 2018. Aspect-Based Financial Sentiment Analysis Using Deep Learning. In Companion Proceedings of the The Web Conference 2018 (WWW ’18). Inter- national World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1961–1966. https://doi.org/10.1145/3184558.3191827 [13] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. [14] Ji Young Lee and Franck Dernoncourt. 2016. Sequential Short-Text Clas- sification with Recurrent and Convolutional Neural Networks. In Proceed- ings of the 2016 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies. Associa- tion for Computational Linguistics, San Diego, California, 515–520. https: //doi.org/10.18653/v1/N16-1062 [15] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd ed.). Cambridge University Press, New York, NY, USA. [16] Bing Liu. 2015. Sentiment Analysis - Mining Opin- ions, Sentiments, and Emotions. Cambridge University Press. http://www.cambridge.org/us/academic/subjects/ computer-science/knowledge-management-databases-and-data-mining/ sentiment-analysis-mining-opinions-sentiments-and-emotions [17] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regulari- ties in Continuous Space Word Representations. In Human Language Technolo- gies: Conference of the North American Chapter of the Association of Computa- tional Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, At- lanta, Georgia, USA. 746–751. https://www.aclweb.org/anthology/N13-1090/ [18] Aliaksei Severyn and Alessandro Moschitti. 2015. Twitter Sentiment Analysis with Deep Convolutional Neural Networks. In Proceedings of the 38th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15). Association for Computing Machinery, New York, NY, USA, 959–962. https://doi.org/10.1145/2766462.2767830 [19] David Vilares, Carlos Gomez-Rodriguez, and Miguel A. Alonso. 2017. Uni- versal, unsupervised (rule-based), uncovered sentiment analysis. Knowledge- Based Systems 118 (2017), 45 – 55. https://doi.org/10.1016/j.knosys.2016.11.014 [20] Xiaojun Wan. 2009. Co-Training for Cross-Lingual Sentiment Classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, 235–243. https://www.aclweb.org/anthology/P09-1027