=Paper=
{{Paper
|id=Vol-2769/36
|storemode=property
|title=Polarity Imbalance in Lexicon-based Sentiment Analysis
|pdfUrl=https://ceur-ws.org/Vol-2769/paper_36.pdf
|volume=Vol-2769
|authors=Marco Vassallo,Giuliano Gabrieli,Valerio Basile,Cristina Bosco
|dblpUrl=https://dblp.org/rec/conf/clic-it/VassalloGBB20
}}
==Polarity Imbalance in Lexicon-based Sentiment Analysis==
Polarity Imbalance in Lexicon-based Sentiment Analysis
Marco Vassallo1 , Giuliano Gabrieli1 , Valerio Basile2 , Cristina Bosco2
1. CREA Research Centre for Agricultural Policies and Bio-economy, Italy
2. Dipartimento di Informatica, Università degli Studi di Torino, Italy
{marco.vassallo|giuliano.gabrieli}@crea.gov.it, {valerio.basile|cristina.bosco}@unito.it
Abstract con), usando come corpus per la ponder-
azione TWITA, un corpus di larga scala di
Polarity imbalance is an asymmetric sit- messaggi da Twitter in italiano. La nuova
uation that occurs while using para- risorsa Weighted-MAL (W-MAL), presen-
metric threshold values in lexicon-based tata per la prima volta in questo arti-
Sentiment-Analysis (SA). The variation colo, ottiene migliori risultati nella classi-
across the thresholds may have an opposite ficazione della polarità specialmente, per
impact on the prediction of negative and i messaggi negativi, oltre ad alleviare il
positive polarity. We hypothesize that this problema sopracitato di sbilanciamento
may be due to asymmetries in the data or della polarità.
in the lexicon, or both. We carry out there-
fore experiments for evaluating the effect
of lexicon and of the topics addressed 1 Introduction and Motivation
in the data. Our experiments are based
on a weighted version of the Italian lin- Sentiment Analysis (SA) is the task of Natural
guistic resource MAL (Morphologically- Language Processing that aims at extracting opin-
inflected Affective Lexicon) by using as ions from natural language expressions, e.g., re-
weighting corpus TWITA, a large-scale views or social media posts. The basic approaches
corpus of messages from Twitter in Ital- to SA typically fall into one of two categories:
ian. The novel Weighted-MAL (W-MAL), dictionary-based and supervised machine learn-
presented for the first time int this paper, ing. Methods based on a dictionary make use of
achieved better polarity classification re- affective lexicons, language resources where each
sults especially for negative tweets, along word or lemma is associated to a score indicating
with alleviating the aforementioned polar- its affective valence (e.g., polarity). In SA they are
ity imbalance. faster than supervised statistical approaches and
require minimal adaptation, unless the resource
Italiano. Lo sbilanciamento della polarità is domain-specific, also when applied to multi-
è una situazione di asimmetria che si viene ple environments with minimal adaptation over-
a creare quando si impiegano valori soglia head. However, they only achieve good perfor-
parametrici nella Sentiment Analysis (SA) mance for identifying coarse opinion tendencies in
basata su dizionario. La variazione dei large datasets, since they cannot take into account
valori soglia può avere un impatto opposto the impact of the context on the polarity value as-
rispetto alla predizione di polarità neg- sociated to a word.
ativa e positiva. Si ipotizza che questo Supervised statistical methods, on the other hand,
effetto sia dovuto ad asimmetrie nei dati tend to provide better quality predictions across
o nel dizionario, o in entrambi. Abbi- benchmarks, due to their better ability to gener-
amo condotto esperimenti per misurare alize over individual words and expressions, and
l’effetto del lessico e degli argomenti trat- learning higher level features. These models also
tati nel nostro dataset. I nostri esperi- show a better ability to adapt to specific domains,
menti sono basati su una versione pon-
Copyright c 2020 for this paper by its authors. Use
derata della risorsa per l’italiano MAL permitted under Creative Commons License Attribution 4.0
(Morphologically-inflected Affective Lexi- International (CC BY 4.0).
provided the availability of data suitable for train- The final section provides conclusive remarks and
ing. some hints about future work.
In order to access the lexical entries in an affec-
tive dictionary, lemmatization must be performed 2 Affective Lexicons
on each single word. Unfortunately, lemmatiza-
SA is typically cast as a text classification task,
tion is an error-prone process, with potentially
very often approached by supervised statisti-
negative impact on the performance of down-
cal models among the NLP research community
stream tasks such as SA. Vassallo et al. (2019)
(Barbieri et al., 2016). However, there are several
introduced a novel computational linguistic re-
scenarios where dictionary-based methods are pre-
source, namely the Morphologically-inflected Af-
ferred, including large-scale industry-ready sys-
fective Lexicon (henceforth MAL) in order to ad-
tems, and domain-specific applications. While
dress this issue by avoiding the lemmatization step
generally less accurate than supervised classifica-
in favor of a morphologically rich affective re-
tion, dictionary-based methods tend to be robust to
source.
the classification of sentiment across different do-
In the experiments we carried out on a specific text
mains, faster and with a higher level of scalability.
genre, namely social media, we have observed that
For the Italian language, several sentiment dic-
using a threshold to assign polarity classes is ben-
tionaries, or, using a more general term, affective
eficial, and using the MAL instead of a lemmati-
lexicons, were published with different levels of
zation step improves the SA performance overall,
granularity of the annotation and availability to the
in particular due to a better prediction of the neg-
public, as summarized on the website of the Italian
ative polarity. However, the variation in threshold
Association of Computational Linguistics1 .
has opposite impact on the prediction of negative
Sentix (Basile and Nissim, 2013) is one of the
and positive tweets.
first affective lexicons created for Italian language,
In this paper, we investigate the motivation beyond
with a first release described in (Basile and Nis-
this polarity imbalance. In particular, we speculate
sim, 2013), and a second release called Sentix
that this may be due to asymmetries in the data
2.02 . It provides an automatic alignment between
(e.g., different internal topics), in the lexicon (e.g.,
SentiWordNet, an automatically-built polarity lex-
different amounts of negative and positive terms),
icon for English by Baccianella et al. (2010), and
or both, and we provide experiments to better un-
the Italian portion of MultiWordNet (Pianta et al.,
derstand this result and validate these hypotheses.
2002). While the first version of Sentix associated
We can therefore summarize as follows our re-
two independent positive and negative polarity
search questions:
scores to each word, in Sentix 2.03 all the senses of
• Is the polarity imbalance due to the topic ad- each lemma have been collapsed into one entry by
dressed? means of a weighted average, where the weights
are proportional to sense frequencies computed on
• Is the polarity imbalance due to the lexi- the sense-annotated corpus SemCor (Langone et
con (i.e., the resources we used, Sentix and al., 2004). Moreover, the positive and negative po-
MAL)? larity scores have been combined to form a single
polarity score ranging from -1 (totally negative) to
• Is the polarity imbalance due to both? 1 (totally positive). Sentix 2.0 includes 41,800 dif-
A further contribution of the paper consists in pro- ferent lemmas.
viding a statistical method for finding the thresh- In order to use a lemma-based affective lexi-
old for using the lexicon in SA tasks. con such as Sentix, lemmatization is a necessary
The paper is organized as follows. In the next step to undertake. In our previous work, we found
section, affective lexicons and the resource MAL that such intermediate step causes a considerable
are discussed. In section 3, we describe the issues amount of noise, in the form of lemmatization er-
related to polarity imbalance in lexicon-based ap- 1
http://www.ai-lc.it/en/affective-
proaches for SA. The fourth section is instead de- lexica-and-other-resources-for-italian/
2
voted to discuss the impact on SA of lexicon and https://github.com/valeriobasile/
sentixR
to introduce W-MAL. Section 5 discusses how the 3
https://github.com/valeriobasile/
topics addressed in the text may impact ob SA. sentixR
Table 1: A tweet with the output of the three lemmatization models where the lemmas are alphabetically
ordered and the errors marked in bold.
Original @ANBI Nazionale Allarme idrico. Dopo il Po anche l’Adige è in crisi
d’acqua https://t.co/GLTlMNqzEv di @AgriculturaIT
ISDT acqua adigire allarme crisi d dopo idrico po - Sentix score: 0.080
POSTWITA acqua adigere allarme crisi di dopo idrico po - Sentix score: 0.080
PARTUT acquare adigere allarme crisi d dopo idrico po - Sentix score: -0.078
rors such as the ones shown in Table 1 (Vassallo et Figure 1 show that the F1 score of the classifica-
al., 2019). We terefore built a new resource on top tion of positive polarity instances increases with
of Sentix, described in the next section. stricter thresholds, while the F1 score of negative
polarity instances decreases.
2.1 MAL We postulate two non-mutually exclusive hy-
We proposed the Morphologically-inflected Af- potheses on the origin of the polarity imbalance,
fective Lexicon in Vassallo et al. (2019, MAL). namely the effect of lexicon and topic. The affec-
It is an extension of Sentix where the entries as- tive scores in the lexicon may be biased towards
sociated to polarity scores rather than lemmas are one end of the polarity spectrum due to a num-
the inflected forms related to each lemma, and the ber of causes, resulting in skewed classification
polarity scores to be associated to each form are results. On the other hand, some topics tend to
drawn from the original lemmas in Sentix. The ap- attract opinions more polarized towards one end
proach consists in linking the lexical items found of the spectrum than the other (e.g., “war” is an
in tweets with the entries of Sentix 2.0, without inherently negative topic), therefore the classifica-
the application of an explicit lemmatization step. tion might be influenced by this intrinsic polariza-
The lexicon is indeed expanded by considering all tion.
the acceptable forms of its lemmas extracted from
the Morph-It collection of Italian forms (Zanchetta 4 The Effect of Lexicon on SA
and Baroni, 2005). Each form takes the same po- In order to shed some light on the polarity im-
larity score of the original lemma, but when differ- balance due to lexicon we applied a weighted
ent lemmas can assume the same form, the arith- approach to MAL by developing the Weighted
metic mean of their polarity scores is assigned. Morphologically-inflected Affective Lexicon (W-
The MAL comprises 148,867 forms and all the MAL). It originates from the intuition that less fre-
items linked to the lemmas of Sentix 2.0 . quent terms should have a higher impact on the
Using the MAL we performed a series of ex- computation of the polarity of the sentence where
periments on the impact of lemmatization on they occur. This principle stems from the observa-
dictionary-based SA, which showed how the re- tion that more sought-after terms are often used to
duction in lemmatization errors leads to a better convey stronger opinions and feelings.
polarity classification performance. We therefore computed the relative frequency
of every item in MAL by using TWITA, a large-
3 Polarity Imbalance in Lexicon-based
scale corpus of messages from Twitter in the Ital-
Sentiment Analysis
ian language (Basile et al., 2018). TWITA is in-
When using an affective lexicon to predict the po- deed large (covering over 500 million tweets from
larity of natural language sentences, a threshold 2012 to 2018, and the collection is currently on-
must be fixed to translate the numerical scores going) and domain-agnostic enough to provide a
into discrete classes, e.g., positive, neutral, and sufficiently representative sample of the distribu-
negative. In Vassallo et al. (2019), we showed tion of the Italian language words, although spe-
how the variation of such threshold has differ- cific to one social media platform.
ent, opposite impacts on the accuracy of the clas- Despite its size, not all the terms from the MAL
sification, using as a benchmark the corpus an- occur in TWITA: 57.9% of the 148,867 terms oc-
notated with sentiment polarity made available curring in MAL were found in TWITA, due to the
by the SENTIment POLarity Classification (SEN- sparseness of particular inflected forms, and to the
TIPOLC) shared task at EVALITA 2016. More presence of multi-word expressions in the lexicon
precisely, the red dotted lines with label ALL in (18,661, about 12%) that were not considered for
Figure 1: Results of the polarity classification on SENTIPOLC. The threshold value on the X-axis is
applied to transform the sum of the scores from the lexicon into a positive or negative label.
matching the resources. For comparison, 73,36% The original Zipf scale is a continuous scale and
of Sentix lemmas were found in TWITA. it ranges from 1 (very low frequency) to 6 (very
Accordingly, the scores of MAL were recalcu- high frequency) or even 7 (e.g., for very frequent
lated by weighting them with the associated words words like auxiliar verbs). By computing the Zipf
frequency in TWITA, using the Zipf scale mea- score of the MAL terms on TWITA, we found
sure (van Heuven et al., 2014). We decided to use some terms with very low frequencies, resulting
this measure because of its easy understanding and in negative values because of the logarithmic func-
the short computation timing. Actually, the Zipf tion. These were re-coded with the minimun Zipf
scale measure is a logarithmic scale based on the value. The resulting weights in the W-MAL range
well-known Zipf law of word frequency distribu- from a minimum of -5.16 to a maximum of 5.95
tion (Zipf, 1949). The computation of Zipf values (the original MAL ranged from -1 to 1). Eventu-
of terms frequencies from TWITA is straightfor- ally, we decided to keep the terms that were not
ward and essentially equals to the logarithm of the found in TWITA in the W-MAL with their MAL
absolute frequency scaled down by a multiplica- original score.
tive factor:
We initially applied the Zipf scale to MAL po-
larity scores by simply multiplying the two found
scores and thus giving more weight to high fre-
f (i)
Zipf (i) = log10 PN
+3
quent terms. However, using the affective lexi-
f (i) N
i=1
+ 10 con with such weighting scheme resulted in a de-
106 6
crease in its polarity classification performance.
where N is the number of tokens in TWITA We therefore simply reversed the Zipf scale by
(6,644,867), f (i) is the absolute frequency of the weighting the original scores inversely with re-
i−th token in TWITA, and the sum of the token spect to their words frequency. By doing so, we
frequencies N
P
i=1 f (i) = 6, 906, 070, 053, there- tested for our speculation of giving more weight
fore: to low frequent terms. We replicated the polar-
ity detection experiment on SENTIPOLC. The re-
f (i)
sults, shown in the green solid lines in Figure 1
Zipf (i) = log10 +3 labeled ALL, indicate a better performance over-
6, 906.07 + 6.644
all, and a reduced imbalance between the positive
(F1-scores standard deviation across the thresh-
olds of 0.035 with W-MAL vs 0.054 with MAL)
and (especially) the negative polarity class (F1-
scores standard deviation across the thresholds of
0.008 with W-MAL vs 0.042 with MAL).
To further clarify the effect found on the polar-
ity scores, we show two example tweets in Figure
24 . In the figure, the MAL and W-MAL scores
are included for the highlighted words, along with
the total polarity scores computed with both dic-
tionaries, showing how the final judgment can
change from neutral to polarized (bottom exam-
ple) or switch polarity entirely (top example). In
particular in the top example the scores are associ-
ated with ”confondesse” (to confuse in subjunctive
mood) and to ”diritto” (right), while in the bot- Figure 2: A comparison between the scores calcu-
tom example the scores are associated with ”Isti- lated for polarized words of a tweet according to
tuto” (school) and to the periphrastic verbal form MAL and W-MAL in two tweets from the test set.
”viene taciuto” (is silenced). This result confirms
our speculation that negative polarity is expressed 5 The Effect of Topic on Sentiment
with more specific words than positive polarity. Analysis
Psychology studies also show that more complex
forms of language were used for expressing crit- In order to investigate the interaction between the
icisms rather that positive evaluations (Stewart, imbalance of dictionary-based polarity classifica-
2015). tion and a possible asymmetry in the data (i.e. dif-
We also notice how the F1-score on the negative ferent internal topics), we performed such classi-
polarity is generally higher than the one on the fication with MAL and W-MAL with the reversed
positive polarity class. This means that the neg- Zipf scale on a benchmark with explicitly stated
ative polarity of tweets is better predicted than topics. As a matter of fact, the test set of SEN-
the positive polarity by means of the weighted TIPOLC is composed of 1,982 Italian tweets, or-
process with the inverse coding. This outcome ganized in 496 general i.e. domain-independent
seems to be substantially supported also by the tweets, and 1,486 political tweets, obtained by fil-
W-MAL directly proportional performance that tering data with specific keywords related to polit-
worked worse than the inverse version in terms of ical Italian figures. The results of our experiment
prediction. This trend was also observed across are also included in Figure 1 with the GENERAL
most of the results of the SENTIPOLC shared and POLITICAL labels.
task, mostly based on supervised models with lex- The first observation we draw from this ex-
ical features, further indicating that the vocabulary periment is that the polarity imbalance is a phe-
of negative sentiments is richer than that of posi- nomenon restricted to the topic-specific section
tive sentiment. of the dataset. This confirms the hypothesis that
dictionary-based polarity classification is affected
by the imbalance issue with the extent to which
its topic is specific. In particular, we hypothesize
that some topics (such as politics) tend to attract
4
The translation of the examples is as follows. For the top opinions more polarized towards one end of the
example: They would be #thegoodschool if meritocracy were
not confused with “doormatcracy”: the one whereby even a
spectrum (the negative one in this case), therefore
right becomes a concession. For the bottom example: @ste- inducing the observed imbalance.
Giannini #thegoodschool In the rankings of the School there The second observation is that weighting the
are also TFA qualified teachers with 48 months of service.
Why is it silenced? where steGiannini refers to the Italian polarity scores in the dictionary based on word fre-
minister for school quency (W-MAL) provides better overall results.
In particular, the F1 scores are better in the topic- For all the aforementioned reasons, this
specific case, specifically due to a better prediction work has drawn our attention to the necessity
of the negative polarity. This result reinforces the of weighting the dictionary-based affective
idea that a polarized topic induces polarity imbal- lexicons to SA with corpora-based word fre-
ance, and therefore a method to alleviate such im- quencies. The resource is freely available at
balance (i.e., a weighting scheme) leads to better https://github.com/valeriobasile/
performance. In our view, a reason for this effect is sentixR/blob/master/sentix/inst/
that topic-specific messages make use of less fre- extdata/W-MAL.tsv
quent words on average. In future work, we plan on working on more
refined weighting strategies, e.g., leveraging the
6 Conclusion and Future Work frequency information of word forms in addition
to lemmas, and taking the topic distribution into
The weighting scheme proposed in this work is consideration. Reducing the computation load
a promising solution to the polarity imbalance in is a challenging goal as well (see Prakash et al.
dictionary-based SA. The experiments show that (2015)). On the other hand, modern transformer-
weighting the polarity scores with word frequen- based models have reached state-of-the-art results
cies yielded a more precise prediction of the po- on the task of polarity detection (Polignano et al.,
larized tweets, with lessened bias in the thresh- 2019), although they are far more expensive and
olds for neutral scores. The novel resource here time- consuming to run. We plan therefore to com-
presented, W-MAL, is an attempt to better charac- pare the predictions of these systems, and study
terize the most sought-after words, which have an ways to integrate their respective strengths (i.e.,
impact on the interaction between sentiment and speed and transparency of the dictionary-based ap-
topic. We believe it also represents a promising proach vs. the superior prediction capability of the
attempt to control for context-dependency while deep neural models) in order to boost the overall
using lexicon-based methods for SA. performance.
In particular, with this resource we try to give The present work was originally conceived
voice to the linguistic intuition that the exploita- in the framework of the AGRItrend project led
tion of a specific form within a message might by the CREA Research Centre for Agricultural
meaningfully impact on the sentiment expressed Policies and Bio-economy, aiming at collecting
in the message. For instance, referring to the top and analyzing social media data for opinions in
example in figure 2, by exploiting the subjunctive the domain of public policies and agriculture.
mood ”confondesse” of the verb ”confondere” (to As such, we plan on studying the impact of
confuse), the author joins together with the mean- the techniques presented in this paper on that
ing of the verb also a sense of doubtfulness and of particular domain, and observe if the same, or
unreality. This is also improved by the fact that different, patterns emerge. On a similar line,
this form introduces a clause which is coordinated so far we conducted experiments on data from
with the clause headed by a verb in conditional Twitter, which facilitates access to large quantity
mood, i.e. ”sarebbe” (form of to be). This form of data but restricts the range of text style and
of the verb ”confondere” seems especially ade- genre found in them.
quate for contexts where a negative polarity is ex-
pressed and less appropriate for other cases. The
use of this specific mood for the verb has there-
fore a meaningful impact on the sentiment ex- References
pressed. The MAL properly encodes this informa- Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-
tion, which may be lost when a lemmatization step tiani. 2010. SentiWordNet 3.0: An enhanced lexi-
is applied on text and all forms are subsequently cal resource for sentiment analysis and opinion min-
ing. In Proceedings of the Seventh conference on
considered as bearing the same meaning without International Language Resources and Evaluation
further nuances. But the W-MAL does also bet- (LREC’10). European Languages Resources Asso-
ter: it encodes the probabilistic information about ciation (ELRA).
how suitable a form is for expressing a particular Francesco Barbieri, Valerio Basile, Danilo Croce,
sentiment with respect to other available forms in Malvina Nissim, Nicole Novielli, and Viviana Patti.
a given context. 2016. Overview of the Evalita 2016 SENTIment
POLarity Classification Task. In Proceedings of George Kingsley Zipf. 1949. Human Behaviour and
Third Italian Conference on Computational Linguis- the Principle of Least Effort: an Introduction to Hu-
tics (CLiC-it 2016) & Fifth Evaluation Campaign of man Ecology. Addison-Wesley.
Natural Language Processing and Speech Tools for
Italian. Final Workshop (EVALITA 2016). CEUR-
WS.org.
Valerio Basile and Malvina Nissim. 2013. Sentiment
analysis on Italian tweets. In Proceedings of the 4th
Workshop on Computational Approaches to Subjec-
tivity, Sentiment and Social Media Analysis, pages
100–107.
Valerio Basile, Mirko Lai, and Manuela Sanguinetti.
2018. Long-term Social Media Data Collection
at the University of Turin. In Proceedings of the
Fifth Italian Conference on Computational Linguis-
tics (CLiC-it 2018). CEUR-WS.org.
Helen Langone, Benjamin R. Haskell, and George A.
Miller. 2004. Annotating WordNet. In Proceed-
ings of the Workshop Frontiers in Corpus Annotation
at HLT-NAACL 2004, pages 63–69. Association for
Computational Linguistics (ACL).
Emanuele Pianta, Luisa Bentivogli, and Christian Gi-
rardi. 2002. MultiWordNet: developing an aligned
multilingual database. In Proceedings of the First
International Conference on Global WordNet, pages
293–302.
Marco Polignano, Pierpaolo Basile, Marco de Gem-
mis, Giovanni Semeraro, and Valerio Basile. 2019.
ALBERTO: Italian BERT Language Understanding
Model for NLP Challenging Tasks Based on Tweets.
In Proceedings of the Sixth Italian Conference on
Computational Linguistics (CLiC-it 2019). CEUR-
WS.org.
Saurabh Prakash, T. Chakravarthy, and E. Kaveri.
2015. Statistically weighted reviews to enhance sen-
timent classification. Karbala International Journal
of Modern Science, 1:26–31.
Martyn Stewart. 2015. The language of praise and
criticism in a student evaluation survey. Studies In
Educational Evaluation, 45:1–9.
Walter J. B. van Heuven, Pawel Mandera, Emmanuel
Keuleers, and Marc Brysbaert. 2014. SUBTLEX-
UK: a new and improved word frequency database
for British English. The Quarterly Journal of Ex-
perimental Psychology, 67:6:1176–1190.
Marco Vassallo, Giuliano Gabrieli, Valerio Basile, and
Cristina Bosco. 2019. The tenuousness of lemmati-
zation in lexicon-based sentiment analysis. In Pro-
ceedings of the Sixth Italian Conference on Compu-
tational Linguistics - CLiC-it 2019. Academia Uni-
versity Press.
Eros Zanchetta and Marco Baroni. 2005. Morph-it!
a free corpus-based morphological resource for the
Italian language. Corpus Linguistics 2005, 1(1).