Comparative Experiments for Multilingual Sentiment Analysis Using Machine Translation Alexandra Balahur and Marco Turchi alexandra.balahur@jrc.ec.europa.eu marco.turchi@jrc.ec.europa.eu European Commission Joint Research Centre IPSC, GlobeSec, OPTIMA Via E. Fermi 2749, Ispra, Italy Abstract. Sentiment analysis is the Natural Language Processing (NLP) task dealing with sentiment detection and classification from text. Given the impor- tance of user-generated contents on the recent Social Web, this task has received much attention from the NLP research community in the past years. Sentiment analysis has been studied in different types of texts and in the context of distinct domains. However, only a small part of the research concentrated on dealing with sentiment analysis for languages other than English, which most of the times lack or have few lexical resources In this context, the present article proposes and eval- uates the use of machine translation and supervised methods to deal with senti- ment analysis in a multilingual context. Our extensive evaluation scenarios, for German, Spanish and French, using three different machine translation systems and various supervised algorithms show that SMT systems can start to be em- ployed to obtain good quality data for other languages. Subsequently, this data can be employed to train classifiers for sentiment analysis in these languages, reaching performances close to the one obtained for English. 1 Introduction During the past years, the contents that are generated by users on the Web, in the form of comments and statements of opinions in fora, blogs, reviewing sites, microblogs, have become more and more important. Their high volume and unbiased nature, as well as the fact that they are written by people from all social categories, all over the world, make such information useful to many domains, such as Economics, Social Science, Political Science, Marketing, to mention just a few. Nevertheless, the high quantity of such data and the high rate in which it is produced requires that automatic mechanisms are employed in order to extract valuable knowledge from it. In the case of opinion- ated data, this issue motivated the rapid and steady growth in interest from the Natural Language Processing (NLP) community to develop computational methods to analyze subjectivity and sentiment in text. These tasks received many names, from which “sub- jectivity analysis”, “sentiment analysis” and “opinion mining” are the most frequently employed ones. The body of research conducted within these tasks has proposed differ- ent methods to deal with subjectivity and sentiment classification in different texts and domains, reaching satisfactory levels of performance for English. However, for certain 75 2 Authors Suppressed Due to Excessive Length applications, such as news monitoring, the information in languages other than English is also highly relevant and cannot be disregarded, as it represents a high percentage of relevant data. In this type of systems, additionally, sentiment analysis tools must be reliable and perform at similar levels as the ones implemented for English. In order to overcome the above-mentioned issue, the work presented herein aims to propose and evaluate different methods for multilingual sentiment analysis using ma- chine translation and supervised methods. In particular, we will study this issue in three languages - French, German and Spanish - using three different Machine Translation systems - Google Translate, Bing Translator1 and Moses [11] and different machine learning models. To have a more precise measure of the impact of quality translation on this task, we create Gold Standard sets for each of the three languages. Our experiments show that machine translation systems are reaching a reasonable level of maturity so as to be employed for multilingual sentiment analysis and that for some languages (for which the translation quality is high enough) the performance that can be attained is similar to that of systems implemented for English, in terms of weighted F-measure. 2 Related Work Most of the research in subjectivity and sentiment analysis was done for English. How- ever, there were some authors who developed methods for the mapping of subjectiv- ity lexicons to other languages. To this aim, [9] use a machine translation system and subsequently use a subjectivity analysis system that was developed for English to cre- ate subjectivity analysis resources in other languages. [12] propose a method to learn multilingual subjective language via cross-language projections. They use the Opinion Finder lexicon [22] and use two bilingual English-Romanian dictionaries to translate the words in the lexicon. Another approach was proposed by Banea et al. [3]. To this aim, the authors perform three different experiments - translating the annotations of the MPQA corpus, using the automatically translated entries in the Opinion Finder lexicon and the third, validating the data by reversing the direction of translation. In a further approach, Banea et al. [2] apply bootstrapping to build a subjectivity lexicon for Roma- nian, starting with a set of 60 words which they translate and subsequently filter using a measure of similarity to the original words, based on Latent Semantic Analysis (LSA) [8] scores. Yet another approach to mapping subjectivity lexica to other languages is proposed by Wan (2009), who uses co-training to classify un-annotated Chinese re- views using a corpus of annotated English reviews. [10] create a number of systems consisting of different subsystems, each classifying the subjectivity of texts in a differ- ent language. They translate a corpus annotated for subjectivity analysis (MPQA), the subjectivity clues (Opinion Finder) lexicon and re-train a Naive Bayes classifier that is implemented in the Opinion Finder system using the newly generated resources for all the languages considered. [4] translate the MPQA corpus into five other languages (some with a similar ethimology, others with a very different structure). Subsequently, they expand the feature space used in a Naive Bayes classifier using the same data trans- lated to 2 or 3 other languages. Finally, [18, 19] create sentiment dictionaries in other 1 http://translate.google.it/ and http://www.microsofttranslator.com/ 76 Title Suppressed Due to Excessive Length 3 languages using a method called “triangulation”. They translate the data, in parallel, from English and Spanish to other languages and obtain dictionaries from the intersec- tion of these two translations. Attempts to use machine translation in different natural language processing tasks have not been widely used due to poor quality of translated texts, but recent advances in Machine Translation have motivated such attempts. In Information Retrieval, [17] pro- posed a comparison between Web searches using monolingual and translated queries. On average, the results show a drop in performance when translated queries are used, but it is quite limited, around 15%. For some language pairs, the average result ob- tained is around 10% lower than that of a monolingual search while for other pairs, the retrieval performance is clearly lower. In cross-language document summarization, [21, 5] combined the MT quality score with the informativeness score of each sentence in a set of documents to automatically produce summary in a target language using a source language texts. In [21], each sentence of the source document is ranked accord- ing both the scores, the summary is extracted and then the selected sentences translated to the target language. Differently, in [5], sentences are first translated, then ranked and selected. Both approaches enhance the readability of the generated summaries without degrading their content. 3 Motivation and Contribution The work presented herein is mainly motivated by the need to develop sentiment analy- sis tools for a high number of languages, while minimizing the effort to create linguistic resources for each of these languages in part. Unlike approaches we presented in Re- lated Work section, we employ fully-formed machine translation systems. In this con- text, another novelty in our approach is that we also study the influence of the difference in translation performance has on the sentiment classification performance. Additionally, whereas the distinct characteristics of translated data (when compared to the original data) may imply that other features could be more appropriate. Moreover, such approaches have usually employed only simple machine learning algorithms. No attempt has been made to study the use of meta-classifiers to enhance the performance of the classification through the removal of noise in the data. More specifically, we employ three MT systems - Bing Translator, Google Translate and Moses to translate data from English to three languages - French, German and Spanish. We create a Gold Standard for all the languages, used, on the one hand, to measure the translation quality and to test the performance of sentiment classification on translated (noisy) versus correct data. These correct translations allow us to have a more precise measure of the impact of translation quality on the sentiment classification task. Another contribution this article brings is the study of different types of features that can be employed to build machine learning models for the sentiment task. Further on, apart from studying different features that can be used to represent the training data, we also study the use of meta-classifiers to minimize the effect of noise in the data. Our comparative results show, on the one hand, that machine translation can be reliably used for multilingual sentiment analysis and, on the other hand, which are the main characteristics of the data for such approaches to be successfully employed. 77 4 Authors Suppressed Due to Excessive Length 4 Dataset Presentation and Analysis For our experiments, we employed the data provided for English in the NTCIR 8 Mul- tilingual Opinion Analysis Task (MOAT)2 . In this task, the organizers provided the participants with a set of 20 topics (questions) and a set of documents in which sen- tences relevant to these questions could be found, taken from the New York Times Text (2002-2005) corpus. The documents were given in two different forms, which had to be used correspondingly, depending on the task to which they participated. The first variant contained the documents split into sentences (6165 in total) and had to be used for the task of opinionatedness, relevance and answerness. In the second form, the sen- tences were also split into opinion units (6223 in total) for the opinion polarity and the opinion holder and target tasks. For each of the sentences, the participants had to provide judgements on the opinionatedness (whether they contained opinions), rele- vance (whether they are relevant to the topic). For the task of polarity classification, the participants had to employ the dataset containing the sentences that were also split into opinion units (i.e. one sentences could contain two/more opinions, on two/more different targets or from two/more different opinion holders). For our experiments, we employed the latter representation. From this set, we ran- domly chose 600 opinion units, to serve as test set. The rest of opinion units will be employed as training set. Subsequently, we employed the Google Translate, Bing Trans- lator and Moses systems to translate, on the one hand, the training set and on the other hand the test set, to French, German and Spanish. Additionally, we employed the Ya- hoo system (whose performance was the lowest in our initial experiments) to translate only the test set into these three languages. Further on, this translation has been cor- rected manually by a person, for all the languages. This corrected data serves as Gold Standard3. Most of these sentences, however, contained no opinion (were neutral). Due to the fact that the neutral examples are majoritary and can produce a large bias when classifying the polarity of the sentences, we eliminated these examples and employed only the positive and negative sentences in both the training, as well as the test sets. After this elimination, the training set contains 943 examples (333 positive and 610 negative) and the test set and Gold Standard contain 357 examples (107 positive and 250 negative). Although the upper bound for each of the systems would be possible to estimate using Gold Standard for each of the training sets, as well, at this point we considered the scenario that is closer to real situations, in which the issue is related to the inexistence of training data for a specific language. 5 Using Machine Translation for Multilingual Sentiment Analysis The issue of extracting and classifying sentiment in text has been approached using different methods, depending on the type of text, the domain and the language con- sidered. Broadly speaking, the methods employed can be classified into unsupervised 2 http://research.nii.ac.jp/ntcir/ntcir-ws8/permission/ntcir8xinhua-nyt-moat.html 3 We translated the whole sentences, not opinion units separately, so sentences containing mul- tiple opinion units were translated twice. After duplicate elimination, we remained with 400 sentences in the test and Gold Standard sets and 5700 sentences in the training set. 78 Title Suppressed Due to Excessive Length 5 (knowledge-based), supervised and semi-supervised methods. The first usually employ lexica or dictionaries of words with associated polarities (and values - e.g. 1, -1) and a set of rules to compute the final result. The second category of approaches employ statistical methods to learn classification models from training data, based on which the test data is then classified. Finally, semi-supervised methods employ knowledge-based approaches to classify an initial set of examples, after which they use different machine learning methods to bootstrap new training examples, which they subsequently use with supervised methods. The main issue with the first approach is that obtaining large-enough lexica to deal with the variability of language is very expensive (if it is done manually) and gener- ally not reliable (if it is done automatically). Additionally, the main problem of such approaches is that words outside contexts are highly ambiguous. Semi-supervised ap- proaches, on the other hand, highly depend on the performance of the initial set of exam- ples that is classified. If we are to employ machine translation, the errors in translating this small initial set would have a high negative impact on the subsequently learned examples. The challenge of using statistical methods is that they require training data (e.g. annotated corpora) and that this data must be reliable (i.e. not contain mistakes or “noise”). The lower the performance in classifying, the more sparse will be the feature vectors employed in the machine learning models. However, the larger this dataset is, the less influence the translation errors have. Since we want to study whether machine translation can be employed to perform sentiment analysis for different languages, we employed statistical methods in our ex- periments. More specifically, we used Support Vector Machines Sequential Minimal Optimization (SVM SMO), with different types of features (n-grams, presence of sen- timent words), since the literature in the field has confirmed it as the best-performing machine learning algorithm for this task [16]. For the purpose of our experiments, three different SMT systems were used to trans- late the human annotated sentences: two existing online services such as Google Trans- late and Bing Translator4 and an instance of the open source phrase-based statistical machine translation toolkit Moses [11], trained on freely available corpora.This results in 2.7 million sentence pairs for English-French, 3.8 for German and 4.1 for Spanish. All the modes are optimized running the MERT algorithm [13] on the development part of the training data. The translated sentences are recased and detokonized (for more details on the system, please see [20]. 6 Experiments In order to test the performance of sentiment classification when using translated data, we employed supervised learning using Support Vector Machines Sequential Minimal Optimization [14] - SVM SMO - with different features: – In the first approach, we represented, for each of the languages and translation systems, the sentences as vectors, whose features marked the presence/absence 4 http://translate.google.com/ and http://www.microsofttranslator.com/ 79 6 Authors Suppressed Due to Excessive Length (boolean) of the unigrams contained in the corresponding training set (e.g. we ob- tained the unigrams in all the sentences in the training set obtained by translating the English training data to Spanish using Google and subsequently represented each sentence in this training set, as well as the test set obtained by translating the test data in English to Spanish using Google marking the presence of the unigram features). – In the second approach, we represented the training and test sets as in the previ- ous representation, with the difference that the features were computed not as the presence of the unigrams, but the tf-idf score of that unigram. – In the third approach, we represented, for each of the languages and translation systems, the sentences as vectors, whose features marked the presence/absence of the unigrams and bigrams contained in the corresponding training set. In our experiments, we also studied the possibility to employ sentiment-bearing words in the sentences to be classified as features for the machine learning algorithm. In order to do this, we employed the SentiWordNet, General Inquirer and WordNet Affect dictionaries for English and the multilingual dictionaries created by (Steinberger et al., 2012). The main problem of this approach was, however, that very few features were found, for a small number of the sentences to be classified, on the one hand because affect is not expressed in these sentences using lexical clues and, on the other hand, because the dictionaries we had at our disposal for languages other than English were not very large (around 1500 words). For this reason, we will not report these results. Table 1 presents the number of unigram and bigram features employed in each of the cases. Language SMT system Nr. of unigrams Nr. of bigrams — 5498 15981 English Bing 7441 17870 Google 7540 18448 French Moses 6938 18814 Bing+Google+Moses 9082 40977 Bing 7817 16216 Google 7900 16078 German Moses 7429 16078 Bing+Google+Moses 9371 36556 Bing 7388 17579 Google 7803 18895 Spanish Moses 7528 18354 Bing+Google+Moses 8993 39034 Table 1. Features employed for representing the sentences in the training and test sets. Subsequently, we performed two sets of experiments: 80 Title Suppressed Due to Excessive Length 7 – In the first set of experiments, we trained an SVM SMO classifier on the training data obtained for each language, with each of the three machine translations, sep- arately (i.e. we generated a model for each of the languages considered, for each of the machine translation systems employed), using the three types of aforemen- tioned features. Subsequently, we tested the models thus obtained on the corre- sponding test set (e.g. training on the Spanish training set obtained using Google Translate and testing on the Spanish test set obtained using Google Translate) and on the Gold Standard for the corresponding language (e.g. training on the Spanish training set obtained using Google Translate and testing on the Spanish Gold Stan- dard). Additionally, in order to study the manner in which the noise in the training data can be removed, we employed one meta-classifier - Bagging [6] (with varying sizes of the bag and SMO as classifier). In related experiments, we also employed other meta-classifiers, such as AdaBoost[1]), but the best results were obtained us- ing Bagging. – In the second set of experiments, we combined the translated data from all three machine translation systems for the same language and created separate models based on the three types of features we extracted from this data (e.g. we created a Spanish training model using the unigrams and bigrams present in the training sets generated by the translation of the training set to Spanish by Google Translate, Bing Translator and Moses). We subsequently tested the performance of the sentiment classification using the Gold Standard for the corresponding language, represented using the corresponding set of features of this model. The results of the experiments (in terms of weighted F-score, per language) are presented in Tables 2, 3, 4 and 5, and for the second set of experiments are presented in Table 6. Feature Representation Test Set SMO Bagging Unigram GS 0.683 0.687 Unigram tf-idf GS 0.651 0.681 Unigram+Bigram GS 0.685 0.686 Table 2. Results obtained for English using the different representations. 7 Results and Discussion Generally speaking, from our experiments using SVM, we could see that incorrect translations imply an increment of the features, sparseness and more difficulties in iden- tifying a hyperplane which separates the positive and negative examples in the training phase. Therefore, a low quality of the translation leads to a drop in performance, as the features extracted are not informative enough to allow for the classifier to learn. For German, an agglutinative language, wrong translation also leads to an explosion of features, of which many are irrelevant for the learning process. 81 8 Authors Suppressed Due to Excessive Length Feature Representation SMT Test Set SMO AdaBoost M1 Bagging BLEU Score Unigram GS 0.655 0.62 0.658 Bing Tr 0.655 0.625 0.666 0.227 Unigram GS 0.64 0.622 0.655 Google T. Tr 0.695 0.645 0.693 0.209 Unigram GS 0.649 0.641 0.675 Moses Tr 0.666 0.654 0.661 0.17 Unigram tf-idf GS 0.627 0.628 0.64 Bing Tr 0.654 0.625 0.673 0.227 Unigram tf-idf GS 0.626 0.598 0.643 Google T. Tr 0.667 0.627 0.693 0.209 Unigram tf-idf GS 0.654 0.646 0.659 Moses Tr 0.664 0.66 0.673 0.17 Unigram+Bigram GS 0.641 0.631 0.648 Bing Tr 0.658 0.636 0.662 0.227 Unigram+Bigram GS 0.646 0.623 0.674 Google T. Tr 0.687 0.645 0.661 0.209 Unigram+Bigram GS 0.644 0.644 0.676 Moses Tr 0.667 0.667 0.674 0.17 Table 3. Results obtained for German using the different feature representations. From Tables 2,3, 4 and 5, we can see that there is a small difference between per- formances of the sentiment analysis system using the English and translated data, re- spectively. In the worst case, there is a maximum drop of 12 percentages using SMO and 8 percentages using Bagging. Ideally, to better measure this drop we would have had to use gold standard training data for each language. As mentioned in Section 4, the creation of the gold standard is a very difficult and time consuming task. We are considering the manual translation of the training data into French, German and Span- ish for the future work. Nonetheless, the scenario considered was aimed at studying the use of MT for SA in the real-life scenario, in which there is no annotated data for the language on which SA is done. The noise in the data appears from two sources - namely the incorrect translations or the features that are not appropriate. Manual inspection of the results has shown that in case of German, the tf-idf obtains the best results because it removes irrelevant features (words that are mentioned very few times). On the other hand, for languages for which the translation quality is higher - i.e. Spanish and French in our case - we obtained better results when using a combination of unigrams and bigrams. After manually inspecting the data, we noticed that cleaner are the data the most useful is the unigram and bigram representation, as this representation increases the quantity of useful features for train- ing. This is not the case for German, where this representation increases to a higher degree the noise (the number of noisy features). In the line of the previous consideration, Bagging, by reducing the variance in the estimated models, produces a positive effect on the performance increasing the F-score, as compared to the learning process and features without Bagging. These improve- 82 Title Suppressed Due to Excessive Length 9 Feature Representation SMT Test Set SMO AdaBoost M1 Bagging BLEU Score Unigram GS 0.627 0.62 0.633 Bing Tr 0.634 0.629 0.618 0.316 Unigram GS 0.635 0.635 0.659 Google T. Tr 0.63 0.63 0.665 0.341 Unigram GS 0.644 0.644 0.639 Moses Tr 0.675 0.675 0.676 0.298 Unigram tf-idf GS 0.659 0.649 0.655 Bing Tr 0.622 0.637 0.646 0.316 Unigram tf-idf GS 0.652 0.652 0.673 Google T. Tr 0.624 0.624 0.637 0.341 Unigram tf-idf GS 0.646 0.646 0.66 Moses Tr 0.677 0.677 0.676 0.298 Unigram+Bigram GS 0.656 0.658 0.646 Bing Tr 0.633 0.633 0.633 0.316 Unigram+Bigram GS 0.653 0.653 0.665 Google T. Tr 0.636 0.667 0.665 0.341 Unigram+Bigram GS 0.664 0.664 0.671 Moses Tr 0.649 0.649 0.663 0.298 Table 4. Results obtained for Spanish using the different feature representations. ments are larger using the German data, because the poor quality of the its translations increases the variance in the data. For the same reason, Bagging is quite effective when unigrams and bigrams are used to represent low quality translated data. In this work we pair Bagging with SMO, but we are interested in running experiments using weak classifiers such as Naive Bayes or neural networks. Finally, as expected, the performance of the classification is much higher for data obtained using the same translator than on the Gold Standard. This is true, as the same incorrect translations are repeated in both sets and therefore the learning is not influ- enced by these mistakes. Looking at the results in Table 6, we can see that adding all the translated training data together makes the features in the representation more sparse and increases the noise level in the training data, creating harmful effects in terms of classification per- formance: each classifier loses its discriminative capability. This is not the case when using tf-idf on unigrams, in which case the combination of the data improves the clas- sification, as this type of features deter sparsity in data. At language level, clearly the results depend on the translation performance. Only for Spanish (for which we have the highest Bleu score), each classifies is able to prop- erly learn from the training data and try to properly assign the test samples. For the other languages, translated data are so noisy that or the classifier is not able to properly learn the correct information for the positive and the negative classes, and this results in the assignment of most of the test points to one class and zero to the other, or there is significant drop in performance, e.g. for the French language, but the classifier is still able to assign the test points to both the classes. 83 10 Authors Suppressed Due to Excessive Length Feature Representation SMT Test Set SMO AdaBoost M1 Bagging Bleu Score Unigram GS 0.604 0.634 0.644 Bing Tr 0.649 0.654 0.657 0.243 Unigram GS 0.628 0.628 0.638 Google T. Tr 0.652 0.652 0.679 0.274 Unigram GS 0.646 0.666 0.642 Moses Tr 0.663 0.657 0.66 0.227 Unigram tf-idf GS 0.646 0.641 0.645 Bing Tr 0.652 0.661 0.664 0.243 Unigram tf-idf GS 0.635 0.635 0.645 Google T. Tr 0.672 0.672 0.68 0.274 Unigram tf-idf GS 0.656 0.635 0.653 Moses Tr 0.686 0.646 0.671 0.227 Unigram+Bigram GS 0.644 0.645 0.664 Bing Tr 0.644 0.649 0.652 0.243 Unigram+Bigram GS 0.64 0.64 0.659 Google T. Tr 0.652 0.652 0.678 0.274 Unigram+Bigram GS 0.633 0.633 0.645 Moses Tr 0.666 0.666 0.674 0.227 Table 5. Results obtained for French using the different feature representations. The results confirm the capability of Bagging to reduce the model variance and increase the performance in classification, in particular for the ungrams plus tfidf repre- sentation or for the Spanish language. In both the cases, performances are really close (for some configurations even better) to what we obtained using each dataset indepen- dently. 8 Conclusions and Future Work The main objective of this work was to study the manner in which sentiment analysis can be done for languages other than English by employing MT systems and supervised learning. Overall, we could see that MT systems have reached a reasonable level of ma- turity to produce sufficiently reliable training data for languages other than English. Additionally, for some languages, the quality of the translated data is high enough to obtain performances similar to that for the original data using supervised learning with- out any subsequent meta-classification for noise reduction. Finally, even in the worst cases, when the quality of the translated data is not very high, the drop in performance is of maximum 12% and it can be improved on using meta-classifiers. From the differ- ent feature representations, we could see that wrong translations lead to a large number of features, sparseness and noise in the data points in the classification task. This is especially visible in the boolean representation, which is also more sensitive to noise. Through the different types of features and classifiers, we used showing that using un- igrams or tf-idf on unigrams as features, and/or Bagging as a meta-classifier, has a 84 Title Suppressed Due to Excessive Length 11 Unigrams Unigrams + tfidf Unigrams+Bigrams Language SMO AdaBoost M1 Bagging SMO AdaBoost M1 Bagging SMO AdaBoost M1 Bagging To German 0.565∗ 0.563 0.563∗ 0.658 0.64 0.665 0.565∗ 0.563∗ 0.565∗ To Spanish 0.587 0.599 0.598 0.657 0.646 0.666 0.419 0.494 0.511 To French 0.609 0.575 0.578 0.626 0.634 0.635 0.25 0.255 0.23 Table 6. For each language, each classifier has been trained merging the translated data coming form different SMT systems, and tested using the Gold Standard. ∗ Classifier is not able to dis- criminate between positive and negative classes, and assigns most of the test points to one class, and zero to the other. positive impact in the results. Furthermore, in case of good translation quality, we no- ticed that the union of the same training data translated with various systems can help the classifiers to learn different linguistic aspects from the same data. In future work, we plan to further study methods to improve the classification per- formance, both by enriching the features employed, as well as extending the use of meta-classifiers to enhance noise reduction. In particular, the first step will be to adding specialized features corresponding to words belonging to sentiment lexica (in conjunc- tion to the types of features we have already employed) and include high level syntax information can reduce the impact of the translation errors. Finally, we plan to em- ploy confidence estimation mechanisms to filter the best translations, which can subse- quently be employed more reliably for system training. Acknowledgements The authors would like to thank Ivano Azzini, from the BriLeMa Artificial Intelligence Studies, for the advice and support on using meta-classifiers. We would also like to thank the reviewers for their useful comments and suggestions on the paper. References 1. Balahur, A. and Turchi, M. 2012. Multilingual Sentiment Analysis using Machine Transla- tion?. Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis Workshop, 52 Jeju, Republic of Korea. 2. Banea, C., Mihalcea, R., and Wiebe, J. 2008. A bootstrapping method for building subjectivity lexicons for languages with scarce resources. Proceedings of the Conference on Language Resources and Evaluations (LREC 2008), Maraakesh, Marocco. 3. Banea, C., Mihalcea, R., Wiebe, J., and Hassan, S. 2008. Multilingual subjectivity analysis using machine translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), 127-135, Honolulu, Hawaii. 4. Banea, C., Mihalcea, R. and Wiebe, J. 2010. Multilingual subjectivity: are more languages better?. Proceedings of the International Conference on Computational Linguistics (COLING 2010), p. 28-36, Beijing, China. 5. Boudin, F. and Huet, S. and Torres-Moreno, J.M. and Torres-Moreno, J.M. 2010. A Graph- based Approach to Cross-language Multi-document Summarization. Research journal on Computer science and computer engineering with applications (Polibits), 43:113–118. 85 12 Authors Suppressed Due to Excessive Length 6. Breiman, L 1996. Bagging predictors. Machine learning, 24(2):123–140. 7. P. F. Brown, S. Della Pietra, V. J. Della Pietra and R. L. Mercer. 1994. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19:263– 311. 8. Deerwester, S., Dumais, S., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 3(41). 9. Kim, S.-M. and Hovy, E. 2006. Automatic identification of pro and con reasons in online reviews. Proceedings of the COLING/ACL Main Conference Poster Sessions, pages 483. 10. Kim, J., Li, J.-J. and Lee, J.-H. 2006. Evaluating Multilanguage-Comparability of Subjectiv- ity Analysis Systems. Proceedings of the 48th Annual Meeting of the Association for Com- putational Linguistics, pages 595 Uppsala, Sweden, 11-16 July 2010. 11. P. Koehn and H. Hoang and A. Birch and C. Callison-Burch and M. Federico and N. Bertoldi and B. Cowan and W. Shen and C. Moran and R. Zens and C. Dyer and O. Bojar and A. Constantin and E. Herbst 2007. Moses: Open source toolkit for statistical machine transla- tion. Proceedings of the Annual Meeting of the Association for Computational Linguistics, demonstration session, pages 177–180. Columbus, Oh, USA. 12. Mihalcea, R., Banea, C., and Wiebe, J. 2009. Learning multilingual subjective language via cross-lingual projections. Proceedings of the Conference of the Annual Meeting of the Asso- ciation for Computational Linguistics 2007, pp.976-983, Prague, Czech Republic. 13. F. J. Och 2003. Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160–167. Sapporo, Japan. 14. Platt, J. C. 1999. Fast training of support vector machines using sequential minimal opti- mization. Advances in kernel methods, isbn 0-262-19416-3, pages 185–208. 15. K. Papineni and S. Roukos and T. Ward and W. J. Zhu 2001. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318. Philadelphia, Pennsylvania. 16. Pang, B. and Lee, L. 2008. Opinion Mining and Sentiment Analysis. Found. Trends Inf. Retr., vol. 1, nr. 1–2, 2008. 17. J. Savoy, and L. Dolamic. 2009. How effective is Google’s translation service in search?. Communications of the ACM, 52(10):139–143. 18. Steinberger, J. and Lenkova, P. and Ebrahim, M. and Ehrman, M. and Hurriyetoglu, A. and Kabadjov, M. and Steinberger, R. and Tanev, H. and Zavarella, V. and Vazquez, S. 2011. Cre- ating Sentiment Dictionaries via Triangulation. Proceedings of the 2nd Workshop on Compu- tational Approaches to Subjectivity and Sentiment Analysis, Portland, Oregon. 19. Steinberger, J. and Lenkova, P. and Kabadjov, M. and Steinberger, R. and van der Goot, E. 2011. Multilingual Entity-Centered Sentiment Analysis Evaluated by Parallel Corpora. Proceedings of the Conference on Recent Advancements in Natural Language Processing (RANLP), Hissar, Bulgaria. 20. Turchi, M. and Atkinson, M. and Wilcox, A. and Crawley, B. and Bucci, S. and Steinberger, R. and Van der Goot, E. 2012. ONTS:”Optima” News Translation System. Proceedings of EACL 2012, pages 25–31. 21. Wan, X. and Li, H. and Xiao, J. 2010. Cross-language document summarization based on machine translation quality prediction. Proceedings of the 48th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 917–926. 22. Wilson, T., Wiebe, J., and Hoffmann, P. 2005. Recognizing contextual polarity in phrase- level sentiment analysis. Proceedings of HLT-EMNLP 2005, pp.347-354, Vancouver, Canada. 86