=Paper= {{Paper |id=Vol-2481/paper74 |storemode=property |title=The Tenuousness of Lemmatization in Lexicon-based Sentiment Analysis |pdfUrl=https://ceur-ws.org/Vol-2481/paper74.pdf |volume=Vol-2481 |authors=Marco Vassallo,Giuliano Gabrieli,Valerio Basile,Cristina Bosco |dblpUrl=https://dblp.org/rec/conf/clic-it/VassalloGBB19 }} ==The Tenuousness of Lemmatization in Lexicon-based Sentiment Analysis== https://ceur-ws.org/Vol-2481/paper74.pdf

The Tenuousness of Lemmatization in Lexicon-based Sentiment Analysis

Marco Vassallo, Giuliano Gabrieli Valerio Basile, Cristina Bosco
CREA Research Centre Department of Computer Science
for Agricultural Policies and Bio-economy University of Turin
{marco.vassallo, {basile,bosco}@di.unito.it
giuliano.gabrieli}@crea.gov.it

Abstract of misspellings, lingo and irregularities makes the
application of lemmatization on user-generated
English. Sentiment Analysis (SA) based
content drawn from social media and micro-blogs
on an affective lexicon is popular be-
not equally easy.
cause straightforward to implement and
A possible solution consists in applying super-
robust against data in specific, narrow do-
vised machine learning techniques in order to cre-
mains. However, the morpho-syntactic
ate robust lemmatization models. However, the
pre-processing needed to match words in
large manually curated datasets necessary for this
the affective lexicon (lemmatization in
task are currently very rare, in particular for lan-
particular) may be prone to errors. In
guages other than English. For what concerns
this paper, we show how such errors
Italian, a good quality gold standard resource in
have a substantial and statistical signifi-
Universal Dependency has been released which
cant impact on the performance of a sim-
includes texts drawn from micro-blogs, namely
ple dictionary-based SA model on data
PoSTWITA-UD (Sanguinetti et al., 2018). Unfor-
from Twitter in Italian. We test three
tunately it is not nearly large enough to be of prac-
pre-trained statistical models for lemma-
tical use in a supervised machine learning setting.
tization of Italian based on Universal De-
pendencies, and we propose a simple al- In this paper, we focus on the lemmatization
ternative to lemmatizing the tweets that of social media texts, observing and evaluating its
achieves better polarity classification re- impact on SA. The goal of this work is to address
sults.1 the following research questions: what is the im-
pact of lemmatization in SA tasks? Can we classify
1 Introduction lemmatization errors and automatically adjust (a
In the last few years a very large variety of ap- relevant portion of) them?
proaches has been proposed for addressing Sen- We start from the empirical evidence found in a
timent Analysis (SA) related tasks. In several corpus of tweets from the agriculture domain that
approaches, lexical resources play a crucial role: has initially raised our attention on this problem.
they allow systems to move from strings of char- After that, we present further experiments on a
acters to the semantic knowledge found, e.g., in manually annotated dataset. We further propose
an affective lexicon2 . For achieving this result and some hints about a solution based on an affecting
calculating the polarity of sentiment, or of some lexicon of inflected forms.
related categories, some shallow morphological
analysis has to be applied, which mostly consists 2 Datasets
in lemmatization. We collected two datasets of microblogs in Italian
When we refer to standard text, available re- language, in order to experiment on realistic data.
sources and robust lemmatizers make lemmatiza-
tion a practically solved issue, but the presence AGRITREND is a corpus of Italian posts col-
1
Copyright c 2019 for this paper by its authors. Use lected from the Twitter accounts of the main in-
permitted under Creative Commons License Attribution 4.0 stitutional and media actors related to the agri-
International (CC BY 4.0). cultural sector during the period of January-April
2
For an informal definition of affective
lexicon see: http://www.ai-lc.it/ 2019. The data related to the first two months of
lessici-affettivi-per-litaliano/ the year have been used for the publication of the
first issue of the Institutional bulletin of the CREA ment of SentiWordNet (Baccianella et al., 2010)
Research Centre for Agricultural Policies and Bio- and the Italian section of MultiWordNet (Pianta et
economy (Monda et al., 2019). Institutional moti- al., 2002). In particular, we adopt Sentix version
vations drove the initiative of setting up this cor- 2.05 .
pus: exploring the sentiment in agriculture and
thus providing insights about current and emerg- 3.2 Lemmatization
ing trends of the agricultural sector. The dataset In order to match the tweets’ words with a Sen-
is composed of 8,883 tweets, including 2,554 re- tix entry, we need to transform them into their
tweets (28.75% of the total). base forms, i.e., lemmatize the tweets. For
this purpose UDPipe R package with the func-
SENTIPOLC is the corpus distributed for the
tion udpipe annotate was used, applying all the
SENTIment POLarity Classification task (Barbi-
three available models for Italian language: ISDT
eri et al., 2016) within the context of the eval-
(Italian-isdt-ud-2.3-181115), POSTWITA (Italian-
uation campaign EVALITA 20163 . The cor-
postwita-ud-2.3-181115), and PARTUT (Italian-
pus, consisting of 9,392 tweets, was created
partut-ud-2.3-181115). UDPipe (Straka and
partly by querying Twitter for specific keywords
Straková, 2017) is an end-to-end NLP pipeline in-
and hashtags marking political topics, and partly
cluding part-of-speech tagging and syntactic pars-
with random tweets on any topic. Experts and
ing with Universal Dependencies.
crowdsourcing contributors annotated the dataset
We ran the models on AGRITREND. In order to
with subjectivity (binary classification: objec-
automatically estimate the quality of the lemmati-
tive/subjective), polarity (4-fold multiclass clas-
zation, the produced lemmas were checked against
sification: positive/negative/neutral/mixed) and
the Hoepli dictionary, a large, general-purpose on-
irony (binary classification: ironic/not-ironic).
line Italian dictionary comprising over 500,000
3 Processing the AGRITREND corpus lemmas6 . The results, in Table 2, show how the
UDpipe models generated a substantial amount of
In this section, we describe the processing applied improper Italian lemmas. Moreover, for each of
on the AGRITREND with the goal of SA, after the three models, a number between 20% and 30%
the pre-processing which consisted in filtering out of incorrect lemmas were generated correctly by at
hashtags, @mentions, URLs and tokenization. least one of the two other models.
In Table 1 an example is shown of the lemma-
3.1 Lexicon-based Sentiment Analysis tization according to the three models: among
While most modern SA approaches are super- other errors, the named entity Adige was incor-
vised4 , our SA approach is unsupervised and rectly lemmatized by all models.
based on an affective lexicon. However, given the
narrow topic scope of our data of interest and the 3.3 Polarity detection
unavailability of annotated data for agriculture, the We compute the polarity of the lemmatized tweets,
application of an unsupervised classifier allowed including wrong lemmatizations, by matching the
us to avoid domain adaptation issues. Moreover, produced lemmas in Sentix. Incorrect lemmatiza-
the dictionary-based approach is more transparent, tion, even for a single word, may cause serious
allowing us to evaluate its errors at a finer-grained distortions of the polarized scores. For instance,
lexical level. comparing the overall polarity scores calculated
The method is straightforward. Given a pre- for the three models in Table 1, we can see that
processed tweet and an affective lexicon with lem- when PARTUT has been used, a wrong lemma
mas paired to their polarity scores, we match the (which is a non-existing verbal form of the noun
tokens in the tweet to their respective entries in acqua (water)) has been associated to the word ac-
the lexicon, and compute the sum of their values. qua determining the attribution of negative rather
We use Sentix (Basile and Nissim, 2013), an af- than positive score. This phenomenon often oc-
fective lexicon for Italian, created by the align- curs in AGRITREND regardless of the lemmatiza-
3 5
http://www.evalita.it/2016 https://github.com/valeriobasile/
4 sentixR
Already in 2016, only one team out of 13 participated to
6
the SENTIPOLC shared task on Italian SA with an unsuper- https://dizionari.repubblica.it/
vised system. italiano.html
Table 1: A tweet from AGRITREND with the output of the three UDpipe lemmatization models where
the lemmas are alphabetically ordered and the errors marked in bold.
Original @ANBI Nazionale Allarme idrico. Dopo il Po anche l’Adige è in crisi
d’acqua https://t.co/GLTlMNqzEv di @AgriculturaIT
ISDT acqua adigire allarme crisi d dopo idrico po - Sentix score: 0.080
POSTWITA acqua adigere allarme crisi di dopo idrico po - Sentix score: 0.080
PARTUT acquare adigere allarme crisi d dopo idrico po - Sentix score: -0.078

In our case the samples are coupled, since they are
Table 2: Number and rate of lemmas produced by
composed of the same tweets with potential dif-
the UDpipe lemmatization models and not found
ferent lemmas and the scores are the polarity of
in the Hoepli dictionary.
Model Incorrect lemmas % the tweets after lemmatization. As a consequence,
ISDT 19,707 44.5 the test is able to simply evaluate if the differ-
POSTWITA 21,444 48.4
PARTUT 22,440 50.7 ence between the polarity of the tweets is due to
the sign and the magnitude of the score simultane-
ously. The results of the Wilcoxon test, computed
tion model applied. Table 3 shows the percentages
with the statistical package SPSS, are presented in
of negative, neutral and positive tweets based on
Table 4.
the assigned polarity for each model. Here we
The results of the Wilcoxon test are not statis-
consider positive a tweet whose Sentix score is
tically significant between ISDT and POSTWITA.
greater than zero, negative when lower than zero,
The polarity obtained with the PARTUT lemmati-
and neutral if it is exactly zero.
zation is significantly different from the other two,
in line with the observation of a higher number of
Table 3: Polarity classification on AGRITREND incorrect lemmas (51%, see Table 2). The result
lemmatized with different UDpipe models. of this test indicates that an incorrect lemmatiza-
Model Negative Neutral Positive
ISDT 32.6% 9.5% 57.9%
tion produces statistically significant differences
POSTWITA 32.3% 10.2% 57.5% between the subsequent polarity scores and con-
PARTUT 33.8% 11.1% 55.1%
firms our hypothesis.
At the fist glance, from percentages only, we
4 Experiments on SENTIPOLC
might argue that the lemmatization models, each
one with its own bias, classified the tweets in a In the previous section, we analyzed the lemma-
similar manner. However, at this step of analy- tization errors produced by three UDpipe mod-
sis, we cannot say anything about statistical dif- els on AGRITREND and we observed how statis-
ferences in the size and in the signs of the polarity tically significant is the failure in lemmatization
scores between each model. on the result of dictionary-based SA. Neverthe-
less, being the AGRITREND corpus not annotated
3.4 Statistical significance for sentiment polarity, we could not say anything
If the differences between the scores were not sta- about the accuracy of the prediction. To bridge
tistically significant, the incorrect lemmatization this gap, we repeated the experiment on SEN-
should not impact on the polarity scores. Con- TIPOLC, where ground truth labels (also called
versely, if significant differences exist, the lemma- gold standard labels) were manually annotated,
tization models will generate different polarity starting by running the same processing pipeline
scores, severely affected by the incorrect lemma- as for AGRITREND. Table 5 shows an example
tization. In order to verify this hypothesis, we tweet with the corresponding polarity scores. In
applied the non-parametric statistical signed rank this dataset, the percentages of incorrect lemmas,
test of Wilcoxon (1945) for paired samples to the according to the Hoepli dictionary, is generally
polarity scores for each pair of models. This test is smaller than in the AGRITREND data, but still
commonly used to verify if the difference between substantial: 35% for ISDT, 41% for POSTWITA,
two scores from the same respondents (i.e., sam- 44% for PARTUT (see Table 2 for a comparison
ples) is significantly different without the need for with the other dataset).
the data to follow a known probability distribution Comparing the predictions obtained with Sentix
or high precision in the measures to be tested for. with the labels annotated in SENTIPOLC, we eval-
Table 4: Wilcoxon signed rank test results between pairs of UDPipe models.
ISDT vs. POSTWITA ISDT vs. PARTUT POSTWITA vs. PARTUT
Standardized test statistic -1.317 -6.996 6.208
Asymtotic Sign. (2-sided test) 0.188 (p > 0.05) 0.000 (p < 0.05) 0.000(p < 0.05)
Positive differences 2,190 2,250 2,913
Negative differences 2,281 2,824 2,404
Number of Ties 4,412 3,809 3,566

Table 5: Example tweet from SENTIPOLC with the output of three UDpipe lemmatization models. The
lemmas are ordered alphabetically, since they are further processed as a bag of words.
Original text Capitale Europea della Cultura che combacia con la fine
consultazioni de #labuonascuola: gran bel segnale :)
Bag of words bel Capitale combacia consultazioni Cultura della Europea fine gran segnale
ISDT bello capitale combaciare consultazione cultura di europeo fine grande
segnale - Sentix score: 0,8449
POSTWITA bello capitale combaciare consultazione cultura da europeo fine grande
segnale - Sentix score: 1,0739
PARTUT bel capitale combacia consultazione cultura dere europeo fine
grande segnale - Sentix score: -0,2715

Model F1 (pos.) F1 (neg.) F1 (avg.)
ISDT 0.404 0.535 0.470 same polarity score of the original lemma. When
POSTWITA 0.414 0.540 0.477 different lemmas can assume the same form, we
PARTUT 0.409 0.540 0.474
assign it the arithmetic mean of the lemmas’ po-
Table 6: Performance of the dictionary-based SA, larity scores. We use the morph-it morphological
with different lemmatization models. resource for Italian (Zanchetta and Baroni, 2005)
to extract all possible forms from the lemmas of
Sentix 2.0, and create a Morphologically-inflected
uate the performance of the dictionary-based ap-
Affective Lexicon (MAL) of Italian. The MAL
proach in terms of precision, recall, F1-measure,
comprises 148,867 forms, more than three times
and thus simultaneously measuring the impact of
the size of Sentix 2.0 (41,800 lemmas).
the different lemmatization models on the predic-
tion accuracy. The results are shown in Table 6, in The classification performance obtained using
terms of F1-score for the positive polarity, nega- the MAL instead of a lemmatization model is in
tive polarity, and their average, following the offi- line with the results of the experiment in Table 6:
cial evaluation metrics of the SENTIPOLC task. 0.408 F1 (positive), 0.542 F1 (negative), and 0.475
The Wilcoxon test applied on SENTIPOLC F1 (average). However, so far we have employed
gave very similar results to those achieved on a heuristic to map the Sentix score to polarity
AGRITREND, confirming the similarity of the classes which is highly polarizing, that is, only
classification obtained with ISDT and POST- tweets with an exact score of zero are classified as
WITA, while PARTUT tends to stand apart. More- neutral. We therefore investigated a more conser-
over, errors in lemmatization have a statistically vative approach, where a parametric threshold T
significant impact on the SA on the SENTIPOLC is introduced. After computing the polarity score
dataset to the same extent as AGRITREND. of a message by summing up the polarity of its
constituent words (or lemmas), we assign it a pos-
5 Morphologically-inflected Affective itive polarity label if the score is greater than T
Lexicon and negative if the score is lower than -T. The
results of this experiment are shown in Figure 1.
The analyses presented in the previous sections Several observations can be drawn from these re-
highlight how low coverage and errors in lemmati- sults. First, using a threshold to assign polarity
zation have a negative impact on the performance classes is indeed beneficial, with the right thresh-
of downstream tasks such as SA. In an attempt to old empirically estimated around 5. Second, using
mitigate this issue, we propose an alternative ap- the MAL instead of a lemmatization step improves
proach to link the lexical items found in tweets the SA performance overall, in particular due to a
with the entries of an affective lexicon such as better prediction of the negative polarity. Finally,
Sentix without an explicit lemmatization step. the variation in threshold has opposite impact on
We expand the lexicon by considering all the ac- the prediction of negative and positive tweets. We
ceptable forms of its lemmas. Each form takes the speculate that this may be due to asymmetries in
Figure 1: F1-score for the positive polarity (right), negative polarity (center) and average F1 (left) of the
prediction of the dictionary-based SA approach on the SENTIPOLC test set.

the data, in the lexicon, or both, and intend to carry lemmatization is likely hurtful not only to SA. The
out future studies to understand this result. high reported number of non-existent lemmas cre-
ated by the UDpipe models may severely alter the
6 Discussion results of large-scale statistical studies on social
media data, such as the ones planned by the cre-
Our empirical study highlights important issues ators of the AGRITREND data. Moreover, eval-
arising from language analysis errors (in lemma- uating the correctness of a word by checking an
tization, in particular) propagating down the external dictionary (in our case, Hoepli), is sensi-
pipeline of a simple dictionary-based SA model. ble to potential drawbacks of that resource, e.g.,
Without double-checking the outcome of the leading to overestimating lemmatization errors.
lemmatization step against a dictionary, a signif-
icant amount of noise is introduced in the system, In sum, when choosing a pre-processing strat-
leading to unstable results. The problem is even egy for dictionary-based SA, the need arises to
more substantial when dealing with data in a spe- strike a balance between two extremes: 1) poten-
cific domain, such as the AGRITREND dataset of tially incorrect lemmatization provided by a statis-
tweets about the agricultural domain, which in- tical model, that possibly underestimates the po-
deed raised our attention on this problem. larity; 2) an inclusive approach like MAL, that
possibly overestimates the polarity.
We confronted the POS distribution of the
parsed Agritrend and SENTIPOLC corpora with
the set of UD-parsed corpora in Italian. In the
7 Conclusion and Future Work
Twitter data, content words are slightly more
prominent, while function words are less present,
although the general POS distributions have simi- In this paper, we presented an empirical and sta-
lar shapes. We report however an inverse correla- tistical study on the impact of lemmatization on a
tion between the correctness of the lemmatization NLP pipeline for SA based on an affective lexicon.
and the frequency of the POS, that is, words with We found that lemmatization tools need to be used
infrequent POS are more likely to be wrongly lem- carefully, in order to not introduce too much noise,
matized. deteriorating the performance downstream. Then
We tested the performance in a setting with no we propose an alternative approach that skips the
lemmatization at all, and measured a relatively lemmatization step in favor of a morphologically
good performance on the SENTIPOLC benchmark rich affectve resource, in order to alleviate some of
with some of the parameter configurations. This the observed issues.7 We plan on integrating the
is unsurprising, following our observations on the proposed solutions, including the MAL and an au-
significant impact of incorrect lemmatization on tomatic check of the lemma produced by UDpipe,
the SA performance. However, such a setting is in a pre-processing pipeline based on UDpipe.
linguistically questionable (matching only an arbi-
trary subset of words in a lemma-based resources) 7
The MAL is available for download at https:
and its results are highly variable. //github.com/valeriobasile/sentixR/blob/
It is also important to notice that an incorrect master/sentix/inst/extdata/MAL.tsv
Acknowledgments Frank Wilcoxon. 1945. Individual comparisons by
ranking methods. Biometrics Bulletin, 1(6):80–83.
The work of Marco Vassallo and Giuliano Gabrieli
is funded by the Statistical Office of CREA. The Eros Zanchetta and Marco Baroni. 2005. Morph-it!
a free corpus-based morphological resource for the
work of Valerio Basile and Cristina Bosco is par- Italian language. Corpus Linguistics 2005, 1(1).
tially funded by Progetto di Ateneo/CSP 2016 (Im-
migrants, Hate and Prejudice in Social Media,
S1618 L2 BOSC 01.

References
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-
tiani. 2010. SentiWordNet 3.0: An enhanced lexi-
cal resource for sentiment analysis and opinion min-
ing. In Proceedings of the Seventh conference on
International Language Resources and Evaluation
(LREC’10), Valletta, Malta, May. European Lan-
guages Resources Association (ELRA).
Francesco Barbieri, Valerio Basile, Danilo Croce,
Malvina Nissim, Nicole Novielli, and Viviana Patti.
2016. Overview of the Evalita 2016 SENTIment
POLarity Classification Task. In Proceedings of
Third Italian Conference on Computational Linguis-
tics (CLiC-it 2016) & Fifth Evaluation Campaign of
Natural Language Processing and Speech Tools for
Italian. Final Workshop (EVALITA 2016), Naples,
Italy, December.
Valerio Basile and Malvina Nissim. 2013. Sentiment
analysis on Italian tweets. In Proceedings of the 4th
Workshop on Computational Approaches to Subjec-
tivity, Sentiment and Social Media Analysis, pages
100–107.
Mafalda Monda, Giuliano Gabrieli, and Marco Vas-
sallo. 2019. Sentiment in agricoltura: Il ter-
mometro dell’agricoltura - i principali temi discussi
su Twitter e gli umori degli addetti. In I numeri
dell’Agricoltura Italiana. CREA, Centro Politiche e
Bio-economia, June.
Emanuele Pianta, Luisa Bentivogli, and Christian Gi-
rardi. 2002. Multiwordnet: developing an aligned
multilingual database. In Proceedings of the First
International Conference on Global WordNet, Jan-
uary.
Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
Alessandro Mazzei, Oronzo Antonelli, and Fabio
Tamburini. 2018. PoSTWITA-UD: an Italian Twit-
ter treebank in Universal Dependencies. In Proceed-
ings of the Eleventh International Conference on
Language Resources and Evaluation (LREC-2018),
Miyazaki, Japan, May. European Languages Re-
sources Association (ELRA).
Milan Straka and Jana Straková. 2017. Tokenizing,
pos tagging, lemmatizing and parsing ud 2.0 with
UDPipe. In Proceedings of the CoNLL 2017 Shared
Task: Multilingual Parsing from Raw Text to Univer-
sal Dependencies, pages 88–99, Vancouver, Canada,
August. Association for Computational Linguistics.