=Paper=
{{Paper
|id=Vol-2481/paper74
|storemode=property
|title=The Tenuousness of Lemmatization in Lexicon-based Sentiment Analysis
|pdfUrl=https://ceur-ws.org/Vol-2481/paper74.pdf
|volume=Vol-2481
|authors=Marco Vassallo,Giuliano Gabrieli,Valerio Basile,Cristina Bosco
|dblpUrl=https://dblp.org/rec/conf/clic-it/VassalloGBB19
}}
==The Tenuousness of Lemmatization in Lexicon-based Sentiment Analysis==
The Tenuousness of Lemmatization in Lexicon-based Sentiment Analysis Marco Vassallo, Giuliano Gabrieli Valerio Basile, Cristina Bosco CREA Research Centre Department of Computer Science for Agricultural Policies and Bio-economy University of Turin {marco.vassallo, {basile,bosco}@di.unito.it giuliano.gabrieli}@crea.gov.it Abstract of misspellings, lingo and irregularities makes the application of lemmatization on user-generated English. Sentiment Analysis (SA) based content drawn from social media and micro-blogs on an affective lexicon is popular be- not equally easy. cause straightforward to implement and A possible solution consists in applying super- robust against data in specific, narrow do- vised machine learning techniques in order to cre- mains. However, the morpho-syntactic ate robust lemmatization models. However, the pre-processing needed to match words in large manually curated datasets necessary for this the affective lexicon (lemmatization in task are currently very rare, in particular for lan- particular) may be prone to errors. In guages other than English. For what concerns this paper, we show how such errors Italian, a good quality gold standard resource in have a substantial and statistical signifi- Universal Dependency has been released which cant impact on the performance of a sim- includes texts drawn from micro-blogs, namely ple dictionary-based SA model on data PoSTWITA-UD (Sanguinetti et al., 2018). Unfor- from Twitter in Italian. We test three tunately it is not nearly large enough to be of prac- pre-trained statistical models for lemma- tical use in a supervised machine learning setting. tization of Italian based on Universal De- pendencies, and we propose a simple al- In this paper, we focus on the lemmatization ternative to lemmatizing the tweets that of social media texts, observing and evaluating its achieves better polarity classification re- impact on SA. The goal of this work is to address sults.1 the following research questions: what is the im- pact of lemmatization in SA tasks? Can we classify 1 Introduction lemmatization errors and automatically adjust (a In the last few years a very large variety of ap- relevant portion of) them? proaches has been proposed for addressing Sen- We start from the empirical evidence found in a timent Analysis (SA) related tasks. In several corpus of tweets from the agriculture domain that approaches, lexical resources play a crucial role: has initially raised our attention on this problem. they allow systems to move from strings of char- After that, we present further experiments on a acters to the semantic knowledge found, e.g., in manually annotated dataset. We further propose an affective lexicon2 . For achieving this result and some hints about a solution based on an affecting calculating the polarity of sentiment, or of some lexicon of inflected forms. related categories, some shallow morphological analysis has to be applied, which mostly consists 2 Datasets in lemmatization. We collected two datasets of microblogs in Italian When we refer to standard text, available re- language, in order to experiment on realistic data. sources and robust lemmatizers make lemmatiza- tion a practically solved issue, but the presence AGRITREND is a corpus of Italian posts col- 1 Copyright c 2019 for this paper by its authors. Use lected from the Twitter accounts of the main in- permitted under Creative Commons License Attribution 4.0 stitutional and media actors related to the agri- International (CC BY 4.0). cultural sector during the period of January-April 2 For an informal definition of affective lexicon see: http://www.ai-lc.it/ 2019. The data related to the first two months of lessici-affettivi-per-litaliano/ the year have been used for the publication of the first issue of the Institutional bulletin of the CREA ment of SentiWordNet (Baccianella et al., 2010) Research Centre for Agricultural Policies and Bio- and the Italian section of MultiWordNet (Pianta et economy (Monda et al., 2019). Institutional moti- al., 2002). In particular, we adopt Sentix version vations drove the initiative of setting up this cor- 2.05 . pus: exploring the sentiment in agriculture and thus providing insights about current and emerg- 3.2 Lemmatization ing trends of the agricultural sector. The dataset In order to match the tweets’ words with a Sen- is composed of 8,883 tweets, including 2,554 re- tix entry, we need to transform them into their tweets (28.75% of the total). base forms, i.e., lemmatize the tweets. For this purpose UDPipe R package with the func- SENTIPOLC is the corpus distributed for the tion udpipe annotate was used, applying all the SENTIment POLarity Classification task (Barbi- three available models for Italian language: ISDT eri et al., 2016) within the context of the eval- (Italian-isdt-ud-2.3-181115), POSTWITA (Italian- uation campaign EVALITA 20163 . The cor- postwita-ud-2.3-181115), and PARTUT (Italian- pus, consisting of 9,392 tweets, was created partut-ud-2.3-181115). UDPipe (Straka and partly by querying Twitter for specific keywords Straková, 2017) is an end-to-end NLP pipeline in- and hashtags marking political topics, and partly cluding part-of-speech tagging and syntactic pars- with random tweets on any topic. Experts and ing with Universal Dependencies. crowdsourcing contributors annotated the dataset We ran the models on AGRITREND. In order to with subjectivity (binary classification: objec- automatically estimate the quality of the lemmati- tive/subjective), polarity (4-fold multiclass clas- zation, the produced lemmas were checked against sification: positive/negative/neutral/mixed) and the Hoepli dictionary, a large, general-purpose on- irony (binary classification: ironic/not-ironic). line Italian dictionary comprising over 500,000 3 Processing the AGRITREND corpus lemmas6 . The results, in Table 2, show how the UDpipe models generated a substantial amount of In this section, we describe the processing applied improper Italian lemmas. Moreover, for each of on the AGRITREND with the goal of SA, after the three models, a number between 20% and 30% the pre-processing which consisted in filtering out of incorrect lemmas were generated correctly by at hashtags, @mentions, URLs and tokenization. least one of the two other models. In Table 1 an example is shown of the lemma- 3.1 Lexicon-based Sentiment Analysis tization according to the three models: among While most modern SA approaches are super- other errors, the named entity Adige was incor- vised4 , our SA approach is unsupervised and rectly lemmatized by all models. based on an affective lexicon. However, given the narrow topic scope of our data of interest and the 3.3 Polarity detection unavailability of annotated data for agriculture, the We compute the polarity of the lemmatized tweets, application of an unsupervised classifier allowed including wrong lemmatizations, by matching the us to avoid domain adaptation issues. Moreover, produced lemmas in Sentix. Incorrect lemmatiza- the dictionary-based approach is more transparent, tion, even for a single word, may cause serious allowing us to evaluate its errors at a finer-grained distortions of the polarized scores. For instance, lexical level. comparing the overall polarity scores calculated The method is straightforward. Given a pre- for the three models in Table 1, we can see that processed tweet and an affective lexicon with lem- when PARTUT has been used, a wrong lemma mas paired to their polarity scores, we match the (which is a non-existing verbal form of the noun tokens in the tweet to their respective entries in acqua (water)) has been associated to the word ac- the lexicon, and compute the sum of their values. qua determining the attribution of negative rather We use Sentix (Basile and Nissim, 2013), an af- than positive score. This phenomenon often oc- fective lexicon for Italian, created by the align- curs in AGRITREND regardless of the lemmatiza- 3 5 http://www.evalita.it/2016 https://github.com/valeriobasile/ 4 sentixR Already in 2016, only one team out of 13 participated to 6 the SENTIPOLC shared task on Italian SA with an unsuper- https://dizionari.repubblica.it/ vised system. italiano.html Table 1: A tweet from AGRITREND with the output of the three UDpipe lemmatization models where the lemmas are alphabetically ordered and the errors marked in bold. Original @ANBI Nazionale Allarme idrico. Dopo il Po anche l’Adige è in crisi d’acqua https://t.co/GLTlMNqzEv di @AgriculturaIT ISDT acqua adigire allarme crisi d dopo idrico po - Sentix score: 0.080 POSTWITA acqua adigere allarme crisi di dopo idrico po - Sentix score: 0.080 PARTUT acquare adigere allarme crisi d dopo idrico po - Sentix score: -0.078 In our case the samples are coupled, since they are Table 2: Number and rate of lemmas produced by composed of the same tweets with potential dif- the UDpipe lemmatization models and not found ferent lemmas and the scores are the polarity of in the Hoepli dictionary. Model Incorrect lemmas % the tweets after lemmatization. As a consequence, ISDT 19,707 44.5 the test is able to simply evaluate if the differ- POSTWITA 21,444 48.4 PARTUT 22,440 50.7 ence between the polarity of the tweets is due to the sign and the magnitude of the score simultane- ously. The results of the Wilcoxon test, computed tion model applied. Table 3 shows the percentages with the statistical package SPSS, are presented in of negative, neutral and positive tweets based on Table 4. the assigned polarity for each model. Here we The results of the Wilcoxon test are not statis- consider positive a tweet whose Sentix score is tically significant between ISDT and POSTWITA. greater than zero, negative when lower than zero, The polarity obtained with the PARTUT lemmati- and neutral if it is exactly zero. zation is significantly different from the other two, in line with the observation of a higher number of Table 3: Polarity classification on AGRITREND incorrect lemmas (51%, see Table 2). The result lemmatized with different UDpipe models. of this test indicates that an incorrect lemmatiza- Model Negative Neutral Positive ISDT 32.6% 9.5% 57.9% tion produces statistically significant differences POSTWITA 32.3% 10.2% 57.5% between the subsequent polarity scores and con- PARTUT 33.8% 11.1% 55.1% firms our hypothesis. At the fist glance, from percentages only, we 4 Experiments on SENTIPOLC might argue that the lemmatization models, each one with its own bias, classified the tweets in a In the previous section, we analyzed the lemma- similar manner. However, at this step of analy- tization errors produced by three UDpipe mod- sis, we cannot say anything about statistical dif- els on AGRITREND and we observed how statis- ferences in the size and in the signs of the polarity tically significant is the failure in lemmatization scores between each model. on the result of dictionary-based SA. Neverthe- less, being the AGRITREND corpus not annotated 3.4 Statistical significance for sentiment polarity, we could not say anything If the differences between the scores were not sta- about the accuracy of the prediction. To bridge tistically significant, the incorrect lemmatization this gap, we repeated the experiment on SEN- should not impact on the polarity scores. Con- TIPOLC, where ground truth labels (also called versely, if significant differences exist, the lemma- gold standard labels) were manually annotated, tization models will generate different polarity starting by running the same processing pipeline scores, severely affected by the incorrect lemma- as for AGRITREND. Table 5 shows an example tization. In order to verify this hypothesis, we tweet with the corresponding polarity scores. In applied the non-parametric statistical signed rank this dataset, the percentages of incorrect lemmas, test of Wilcoxon (1945) for paired samples to the according to the Hoepli dictionary, is generally polarity scores for each pair of models. This test is smaller than in the AGRITREND data, but still commonly used to verify if the difference between substantial: 35% for ISDT, 41% for POSTWITA, two scores from the same respondents (i.e., sam- 44% for PARTUT (see Table 2 for a comparison ples) is significantly different without the need for with the other dataset). the data to follow a known probability distribution Comparing the predictions obtained with Sentix or high precision in the measures to be tested for. with the labels annotated in SENTIPOLC, we eval- Table 4: Wilcoxon signed rank test results between pairs of UDPipe models. ISDT vs. POSTWITA ISDT vs. PARTUT POSTWITA vs. PARTUT Standardized test statistic -1.317 -6.996 6.208 Asymtotic Sign. (2-sided test) 0.188 (p > 0.05) 0.000 (p < 0.05) 0.000(p < 0.05) Positive differences 2,190 2,250 2,913 Negative differences 2,281 2,824 2,404 Number of Ties 4,412 3,809 3,566 Table 5: Example tweet from SENTIPOLC with the output of three UDpipe lemmatization models. The lemmas are ordered alphabetically, since they are further processed as a bag of words. Original text Capitale Europea della Cultura che combacia con la fine consultazioni de #labuonascuola: gran bel segnale :) Bag of words bel Capitale combacia consultazioni Cultura della Europea fine gran segnale ISDT bello capitale combaciare consultazione cultura di europeo fine grande segnale - Sentix score: 0,8449 POSTWITA bello capitale combaciare consultazione cultura da europeo fine grande segnale - Sentix score: 1,0739 PARTUT bel capitale combacia consultazione cultura dere europeo fine grande segnale - Sentix score: -0,2715 Model F1 (pos.) F1 (neg.) F1 (avg.) ISDT 0.404 0.535 0.470 same polarity score of the original lemma. When POSTWITA 0.414 0.540 0.477 different lemmas can assume the same form, we PARTUT 0.409 0.540 0.474 assign it the arithmetic mean of the lemmas’ po- Table 6: Performance of the dictionary-based SA, larity scores. We use the morph-it morphological with different lemmatization models. resource for Italian (Zanchetta and Baroni, 2005) to extract all possible forms from the lemmas of Sentix 2.0, and create a Morphologically-inflected uate the performance of the dictionary-based ap- Affective Lexicon (MAL) of Italian. The MAL proach in terms of precision, recall, F1-measure, comprises 148,867 forms, more than three times and thus simultaneously measuring the impact of the size of Sentix 2.0 (41,800 lemmas). the different lemmatization models on the predic- tion accuracy. The results are shown in Table 6, in The classification performance obtained using terms of F1-score for the positive polarity, nega- the MAL instead of a lemmatization model is in tive polarity, and their average, following the offi- line with the results of the experiment in Table 6: cial evaluation metrics of the SENTIPOLC task. 0.408 F1 (positive), 0.542 F1 (negative), and 0.475 The Wilcoxon test applied on SENTIPOLC F1 (average). However, so far we have employed gave very similar results to those achieved on a heuristic to map the Sentix score to polarity AGRITREND, confirming the similarity of the classes which is highly polarizing, that is, only classification obtained with ISDT and POST- tweets with an exact score of zero are classified as WITA, while PARTUT tends to stand apart. More- neutral. We therefore investigated a more conser- over, errors in lemmatization have a statistically vative approach, where a parametric threshold T significant impact on the SA on the SENTIPOLC is introduced. After computing the polarity score dataset to the same extent as AGRITREND. of a message by summing up the polarity of its constituent words (or lemmas), we assign it a pos- 5 Morphologically-inflected Affective itive polarity label if the score is greater than T Lexicon and negative if the score is lower than -T. The results of this experiment are shown in Figure 1. The analyses presented in the previous sections Several observations can be drawn from these re- highlight how low coverage and errors in lemmati- sults. First, using a threshold to assign polarity zation have a negative impact on the performance classes is indeed beneficial, with the right thresh- of downstream tasks such as SA. In an attempt to old empirically estimated around 5. Second, using mitigate this issue, we propose an alternative ap- the MAL instead of a lemmatization step improves proach to link the lexical items found in tweets the SA performance overall, in particular due to a with the entries of an affective lexicon such as better prediction of the negative polarity. Finally, Sentix without an explicit lemmatization step. the variation in threshold has opposite impact on We expand the lexicon by considering all the ac- the prediction of negative and positive tweets. We ceptable forms of its lemmas. Each form takes the speculate that this may be due to asymmetries in Figure 1: F1-score for the positive polarity (right), negative polarity (center) and average F1 (left) of the prediction of the dictionary-based SA approach on the SENTIPOLC test set. the data, in the lexicon, or both, and intend to carry lemmatization is likely hurtful not only to SA. The out future studies to understand this result. high reported number of non-existent lemmas cre- ated by the UDpipe models may severely alter the 6 Discussion results of large-scale statistical studies on social media data, such as the ones planned by the cre- Our empirical study highlights important issues ators of the AGRITREND data. Moreover, eval- arising from language analysis errors (in lemma- uating the correctness of a word by checking an tization, in particular) propagating down the external dictionary (in our case, Hoepli), is sensi- pipeline of a simple dictionary-based SA model. ble to potential drawbacks of that resource, e.g., Without double-checking the outcome of the leading to overestimating lemmatization errors. lemmatization step against a dictionary, a signif- icant amount of noise is introduced in the system, In sum, when choosing a pre-processing strat- leading to unstable results. The problem is even egy for dictionary-based SA, the need arises to more substantial when dealing with data in a spe- strike a balance between two extremes: 1) poten- cific domain, such as the AGRITREND dataset of tially incorrect lemmatization provided by a statis- tweets about the agricultural domain, which in- tical model, that possibly underestimates the po- deed raised our attention on this problem. larity; 2) an inclusive approach like MAL, that possibly overestimates the polarity. We confronted the POS distribution of the parsed Agritrend and SENTIPOLC corpora with the set of UD-parsed corpora in Italian. In the 7 Conclusion and Future Work Twitter data, content words are slightly more prominent, while function words are less present, although the general POS distributions have simi- In this paper, we presented an empirical and sta- lar shapes. We report however an inverse correla- tistical study on the impact of lemmatization on a tion between the correctness of the lemmatization NLP pipeline for SA based on an affective lexicon. and the frequency of the POS, that is, words with We found that lemmatization tools need to be used infrequent POS are more likely to be wrongly lem- carefully, in order to not introduce too much noise, matized. deteriorating the performance downstream. Then We tested the performance in a setting with no we propose an alternative approach that skips the lemmatization at all, and measured a relatively lemmatization step in favor of a morphologically good performance on the SENTIPOLC benchmark rich affectve resource, in order to alleviate some of with some of the parameter configurations. This the observed issues.7 We plan on integrating the is unsurprising, following our observations on the proposed solutions, including the MAL and an au- significant impact of incorrect lemmatization on tomatic check of the lemma produced by UDpipe, the SA performance. However, such a setting is in a pre-processing pipeline based on UDpipe. linguistically questionable (matching only an arbi- trary subset of words in a lemma-based resources) 7 The MAL is available for download at https: and its results are highly variable. //github.com/valeriobasile/sentixR/blob/ It is also important to notice that an incorrect master/sentix/inst/extdata/MAL.tsv Acknowledgments Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83. The work of Marco Vassallo and Giuliano Gabrieli is funded by the Statistical Office of CREA. The Eros Zanchetta and Marco Baroni. 2005. Morph-it! a free corpus-based morphological resource for the work of Valerio Basile and Cristina Bosco is par- Italian language. Corpus Linguistics 2005, 1(1). tially funded by Progetto di Ateneo/CSP 2016 (Im- migrants, Hate and Prejudice in Social Media, S1618 L2 BOSC 01. References Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- tiani. 2010. SentiWordNet 3.0: An enhanced lexi- cal resource for sentiment analysis and opinion min- ing. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, May. European Lan- guages Resources Association (ELRA). Francesco Barbieri, Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli, and Viviana Patti. 2016. Overview of the Evalita 2016 SENTIment POLarity Classification Task. In Proceedings of Third Italian Conference on Computational Linguis- tics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December. Valerio Basile and Malvina Nissim. 2013. Sentiment analysis on Italian tweets. In Proceedings of the 4th Workshop on Computational Approaches to Subjec- tivity, Sentiment and Social Media Analysis, pages 100–107. Mafalda Monda, Giuliano Gabrieli, and Marco Vas- sallo. 2019. Sentiment in agricoltura: Il ter- mometro dell’agricoltura - i principali temi discussi su Twitter e gli umori degli addetti. In I numeri dell’Agricoltura Italiana. CREA, Centro Politiche e Bio-economia, June. Emanuele Pianta, Luisa Bentivogli, and Christian Gi- rardi. 2002. Multiwordnet: developing an aligned multilingual database. In Proceedings of the First International Conference on Global WordNet, Jan- uary. Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli, Alessandro Mazzei, Oronzo Antonelli, and Fabio Tamburini. 2018. PoSTWITA-UD: an Italian Twit- ter treebank in Universal Dependencies. In Proceed- ings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan, May. European Languages Re- sources Association (ELRA). Milan Straka and Jana Straková. 2017. Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Univer- sal Dependencies, pages 88–99, Vancouver, Canada, August. Association for Computational Linguistics.