1. Introduction and rationale

Neutral Score Detection in Lexicon-based Sentiment Analysis: the Quartile-based Approach

Marco Vassallo

Giuliano Gabrieli

Valerio Basile

Cristina Bosco

1 0 CREA Research Centre for Agricultural Policies and Bio-economy , Rome , Italy 1 Dipartimento di Informatica - University of Turin , Turin , Italy

The neutrality detection in Sentiment Analysis (SA) still constitutes an unsolved and debated issue. This work proposes an empirical method based on the quartiles of the polarity distribution for a lexicon-based SA approach. Our experiments are based on the Italian linguistic resource MAL (Morphologically-inflected Afective Lexicon) and applied to two annotated corpora. The findings provided a better detection of the neutral expressions with preserving a substantial overall polarity prediction.

eol>Sentiment Analysis Lexicon Neutrality Optimization

1. Introduction and rationale

incorrectly classified into their respective polarity if they are neutral. Furthermore, for topics with many controSentiment Analysis (SA) is a well-studied task of Natu- versial opinions, where polarizaties are indeed dispersed, ral Language Processing (NLP), whose main objective is the misclassification of neutral expressions appears sigto classify opinions from natural language expressions nificant, as small positive and negative deviations from as positive, neutral, negative or a mixture of those [ 1 ]. zero might be more frequent. As a consequence, the neuThe neutrality detection in SA is an issue approached tral interval also appears to be topic-oriented and thus in diferent ways [ 2, 3, 4 ], but low agreement on how difers from any SA task, as the topic could, in turn, also detecting neutral expressions still exists [4, p.136]. In influence the symmetry of the distribution of scores. The this paper, we approach neutrality detection in lexicon- linguistic counterpart to this phenomenon is that “opinbased SA, where an afective lexicon provides polarity ions may be so diferent that common ground may not scores ranging from − to + with ∈ , by using a be found” [ 5 ]. descriptive statistical method based on the quartiles. On the other hand, especially in the case of unimodal

To our knowledge, this issue was not investigated so distributions, the more asymmetrical the polarity scores far. We aim at drawing attention towards a better predic- distribution is, the more the polarities might be position of the neutral expressions. This is done by automat- tively or negatively skewed, and the less likely a false ically finding out an optimal interval of neutral scores neutral classification should occur. In the case of multiwith a control for the asymmetry of the distribution of modal distributions, with multiple possible polarizations, the scores across the polarity spectrum. Traditionally, detecting the asymmetry becomes more complex as well neutrality scores have been assumed to be around point as the neutral expressions. But, despite the peculiar situ0, or within a conventionally fixed and algebraically-led ation with the same frequencies for oppositely polarized interval of [− .5; +.5]. Conversely, it seems more reason- scores, the more a multimodal distribution is skewed able to postulate that this neutral cluster should lie in a (many diferent modes/peaks possibly far from zero) the dynamic interval around the zero value. As expected, the less likely false neutral classifications should again occur. [− .5; +.5] interval is indeed insuficient for capturing the neutral values, especially when the polarity scores are symmetrical around the point zero. This is because 2. The quartile-based approach small positive or negative deviations from zero can be The quartiles are the values of a variable that divide its CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, relative distribution into four equal parts once the data Dec 04 — 06, 2024, Pisa, Italy are arranged in ascending order. These values are as * Corresponding author. follows: the first quartile 1 represents the value below $ marco.vassallo@crea.gov.it (M. Vassallo); which 25% of the data are situated; 2 is the second giuliano.gabrieli@crea.gov.it (G. Gabrieli); valerio.basile@unito.it quartile or the Median value that exactly splits the data (V. 0B0a0s0il-e0)0;0c1r-i7st0i1n6a-.6b5o4s9co(@Mu.nViatsos.iatll(oC).; 0B0o0s0c-o0)001-8110-6832 into two halves; 3, the third quartile, is the value above (V. Basile); 0000-0002-8857-4484 (C. Bosco) which 25% of the data is situated.

of the cross-validation might not coincide with those found in the whole initial dataset. Nevertheless, they can provide a validation range to which the initial optimal intervals are the upper bound.

3. Experiments on two corpora

We considered two datasets: • AGRITREND [ 7 ], a corpus of Italian tweets on general agricultural topics manually annotated by three diferent annotators • SENTIPOLC which is the benchmark dataset used in the SENTIment Polarity Classification shared task held in EVALITA 2016 [ 8 ], a challenge on polarity detection on Italian tweets; this is another annotated corpus of Italian tweets including texts for three diferent topics (i.e., general (GEN), political (POL) and sociopolitical (SPOL)). scores from − to + (with ≥ 1) the neutral scores should reasonably fall into a sub-interval that belongs to [1; 3] and possibly includes the absolute zero (the neutral score by intuition). Furthermore, this sub-interval of neutral scores is, reasonably, sensitive to the topic and therefore to the asymmetry of the entire polarity distribution. Quartiles also take into account the potential asymmetry of a data distribution since typical values of skewed data fall between 1 and 3. To understand this asymmetrical process, and thus the usefulness of the quartiles in detecting potential deviation from symmetry in a data set, we recall the Galton Skewness index, also known as Bowley’s skewness index [ 6 ], that is based on the quartiles and defined as follows:

= [(3 − 2) − (2 − 1)]/(3 − 1) measures the level of skewness in the dataset as the diference between the lengths of the upper quartile (3 − 2) and the lower quartile (2 − 1), normalized by the length of the interquartile range (3 − 1), i.e. a measure of the variability of the data from the median (2). The index ranges from -1 (the distribution is negatively skewed) to +1 (the distribution is positively skewed) and it is zero for a symmetric distribution.

The SENTIPOLC dataset is composed of 9,410 tweets, pre-divided into a training set (7,410 tweets) and a test set (2,000 tweets). The annotation scheme of SENTIPOLC comprises two non-mutually exclusive binary labels for positive and negative polarity, It is therefore possible for The logic of the optimal quartile-based interval a tweet to be marked as neutral (non-positive and nonThe main challenge now is to reveal the sub-interval negative) or mixed (positive and negative at the same skewed-variant within [1; 3] that can predict the true time). Other two binary labels mark the subjectivity neutral scores without decreasing the positive and neg- of the message (subjective vs. objective) and the ironic ative predictions. By searching for true neutral scores, content. Finally, an additional layer of annotation labels at the same time we risk increasing false positives and the literal positivity and negativity of the tweet, which negatives. This is what presumably happens whenever could be diferent from the actual polarity (called “overall” a default neutral interval of [− .5; +.5] is selected. The polarity in SENTIPOLC). Note that, while this scheme computational idea is straightforward and intuitive, and is quite flexible, not all possible combinations of labels it makes use of annotated corpora. Once calculating the are allowed. In particular, according to a rule for the 1 and 3 in the polarity scores distribution, a R-script dataset, a tweet cannot be labeled at the same time as is set up to routinize a computational process starting objective and as displaying sentiment polarity or irony. from the interval [0; 0] to [1; 3] in increasing/decreas- The origin of the tweets in SENTIPOLC is diverse, with ing steps of .005 for stopping to a sub-interval (within 6,421 tweets which were part of the corpus collected for [1; 3]) that simultaneously optimized the F1 score for the previous edition of the shared task [ 9 ], and the rest the neutral, positive and negative classes. If this simul- from other smaller collections or drawn from Twitter taneous optimization yields to acceptable F1-scores the especially for the purpose of organizing SENTIPOLC entire proposed process can be considered suficient. In 2016. The annotation scheme of AGRITREND is exactly order to validate the approach and provide a tool that the same as SENTIPOLC by design. can be applied to unseen data, we implemented a cross- For this experiment, we applied the MAL1 validation experiment. We randomly split each dataset (Morphologically-inflected-Afective-Lexicon) [ 7 ] into training and test sets by varying percentages of both as afective lexicon ranging from -1 to 1. It was originally in steps of 10%. The strategy of the dual portion-variant steps was due to the rationale of considering all potential and reasonable unseen data situations. The logic steps of the optimal quartiles-based interval was then run on every split to find those optimal intervals in conformity with those desiderata percentages of training and test.

It is straightforward to notice that the optimal intervals 1The MAL was also further implemented with a weighted version named W-MAL [ 10 ] ranging from -5.16 to 5.95 that has considered the word frequencies of TWITA [ 11 ]. We also applied W-MAL in this experiment and the results were in line with those of MAL, although even more extreme. However, since the W-MAL was updated until 2020 and the datasets of AGRITREND and SENTIPOLC were respectively collected until 2022 and 2016, we prefer to present results from the unweighted version. derived from Sentix [ 12 ] and successively augmented tral and 0.575 for positive/negative with negative higher with a collection of Italian forms from the Morph-It [ 13 ]. than positive. Setting the threshold for neutral to the Since the MAL does not classify the mixed labels, we default values of [− 0.5; 0.5] (i.e., in correspondence of selected the tweets with positive, negative and neutral the box on top of the figure) the F1 score (on average) for polarities from both datasets. As a result, AGRITREND neutral increases to 0.946, but the F1 score (on average) was finally composed of 1,224 tweets with 171 neutral for positive/negative decreases to 0.561. Similarly, at the annotated expressions, while SENTIPOLC of 8,892 zero point, F1-scores are on average 0.618 and 0.748. By tweets with 3713 neutral annotated expressions also triggering the optimization process from [0; 0], it contopic-classified as follows: 1,537 for the GEN topic; 1,510 verges to the optimal interval of [− 0.125; 0.285], where for the POL topic; 666 for SPOL topic. F1 scores (on average) are 0.826 for neutral and 0.626 for positive/negative. This result represents a better trade3.1. Results on AGRITREND of for a simultaneous prediction of all the labels with respect to using the default or the zero point intervals.

Tables 2–6 report the quartile-based approach (Table

Corpus Q1 Q2 Q3 G 2 for AGRITREND) cross-validation results with training SENATGIPROITLCREANLDL -00. 0.19295 00..625860 01..930175 00..028145 and test set steps strategy. The optimal interval iniSENTIPOLC GEN 0.000 0.533 1.160 0.081 tially found of [− 0.125; 0.285] can be confirmed from SENTIPOLC POL 0.269 0.816 1.470 0.090 90%-10% to 80%-20% step of training and test sets perSENTIPOLC SPOL 0.060 0.589 1.193 0.066 centages split. However, it would be possible to move until 60%-40% split level (highlighted in bold) which was Table 1 the optimal interval range that simultaneously optimized Quartiles and G values the F1 score for the neutral, positive and negative classes across the cross-validation. In this case, the upper lim

In Table 1, the quartiles and G values are reported. It its increase and thus they need to be looked into. The can be observed that AGRITREND scores are slightly F1-scores (on average) for the training set range from skewed positively (i.e., the G is 0.215). 0.626 to 0.630 and from 0.827 to 0.849 for polarized and Figure 1 shows the computational optimization of the neutral scores, respectively. The F1-scores (on average) quartile-based approach. Starting from the right side of for the test set range from 0.624 to 0.628 and from 0.827 the figure, this corpus has [1; 3] = [− 0.125; 0.907] to 0.829 for polarized and neutral scores, respectively. that corresponds to an average F1 score of 0.908 for neu- Table 9 presents examples of polarized tweets annotated

Limit

Lower Upper -0,250 0,320 -0,135 0,225 -0,160 0,225 -0,140 0,250 -0,130 0,250 -0,125 0,320 -0,125 0,320 -0,125 0,285 -0,125 0,285

Avg. all 0,6157 0,6358 0,6368 0,6303 0,6286 0,6258 0,6284 0,6297 0,6299 Avg. all 0,6170 0,6226 0,6304 0,6337 0,6255 0,6243 0,6221 0,6237 0,6285

Test

Avg. all 0,5679 0,5470 0,5445 0,5411 0,5435 0,5439 0,5474 0,5478 0,5489

Test

Avg. all 0,5711 0,5573 0,5615 0,5658 0,5693 0,5695 0,5707 0,5693 0,5737

Test

Avg. all 0,5322 0,5267 0,5203 0,5147 0,5210 0,5248 0,5309 0,5338 0,5367 10 20 30 40 50 60 70 80 90 In this work, we proposed a descriptive statistical method for a better detection of the neutral expressions in lexicon-based SA with polarity scores. This method is

The values in Table 1 show that the polarized score based on quartiles and therefore on the assumption that distribution is quite symmetrical even within each do- an optimal interval for neutral scores should take always main (i.e., the G values are all close to 0). The results on into account the potential asymmetry of the polarity SENTIPOLC All (i.e., with no specific domain) showed distribution. This seems also in line with the linguistic an optimal interval of [0; 1.175] with 0.548 and 0.868 speculation that the less a topic looks polarized the more of F1-score (on average) for positive/negative and neu- dificult it should be to detect neutral expressions. The tral, respectively. In comparison to the default values rationale is that even small positive or negative values of the interval [− 0.5; 0.5] and to the zero point, the F1- around the zero point could be classified as such while score (on average) for positive/negative also increases they should be instead neutral. Conversely, the more a here (from 0.526 and 0.455 to 0.549) while preserving a topic looks polarized, the easier it should be to detect high F1-score of 0.870 for the neutrals. When the po- neutral expressions. In our view, an optimal interval larized scores distribution is close to perfect symmetry, for detecting neutral scores in lexicon-based SA should the diference between [1; 3] and the optimal interval control for biases caused by the symmetry unbalance in is minimal, which is expected because the quartiles are polarity predictions. skew-dependent. The optimization process we presented starts with

When the SENTIPOLC dataset is divided in specific computing the first ( 1) and the third (3) quartiles of domains, the optimal quartile-based intervals confirmed a polarity score distribution and afterwards finding out the best balance of the predictions between positive/neg- the optimal interval within [1, 3] that balances the ative and neutral scores across all domains (see F1-scores polarity and the neutral predictions simultaneously. We

4. Discussion

Original text A. #Grow!2019: i produttori agricoli #Agrinsieme si confrontano sul #trasporto su gomma e portuale; interventi del copresidente del coordinamento @dinoscanavino e dell’Ad di #Acea

A. Ortofrutta, analisi dei consumi durante il coronavirus-Uci-Unione Coltivatori Italiani https://t.co/UKOaone6oJ S. Italia progredisce se parla di innovazione, scuola digitale e alternanza scuola-lavoro #labuonascuola @cittascienza http://t.co/2pR7MVw40F

S. Come la tecnologia può cambiare le scuole e il sistema di apprendimento? #scuola #labuonascuola http://t.co/9bD4YsA2aG Bag of words produttori agricoli confrontano gomma portuale interventi copresidente coordinamento MAL score -0.0061 analisi consumi coronavirus unione coltivatori italiani Italia progredisce parla innovazione scuola digitale alernananza scuola lavoro tecnologia cambiare scuole sistema apprendimento demonstrated that when the topic of a corpus is generic it requires at least 60%-70% of the data as the training set to find out the optimal interval of neutrals. On the other hand, the more specific the topic is, the less training data it requires to achieve a reasonable optimal interval for neutrals. We stipulate that even a 30% split might be suficient. Our results on two datasets are promising in providing a more precise prediction of neutral scores while preserving a good polarity prediction in comparison to the one obtained by the usual interval of [− .05; +.05] and by the single zero point.

5. Conclusion and future work

The asymmetry of a polarity scores distribution seems to be topic-oriented and therefore the neutrality detection for a lexicon-based SA with polarity scores reasonably passes through an optimal interval within the first and the third quartile [1, 3] that takes this asymmetry into account. The findings of this work stipulated that the quartile-based approach is suitable for any corpus where a task of lexicon-based SA with scores is performed. Hence, we do strongly recommend further experiments on other corpora, both annotated and unannotated, and comparing/integrating this method with others (e.g. Valdivia et al. [ 4 ]) for the common objective of detecting neutral expressions. Eventually, it is worthwhile noticing that our methodological framework led us to run experiments on test sets of diferent sizes in order to consider all potential and reasonable unseen data situations. Alternatively, one could propose a similar experiment with fixed-size test sets, which would have provided more stable, comparable results even with established benchmarks, but on the other hand would also significantly reduce the amount of test data

[1]

Sun ,

Luo ,

Chen , A review of natural language processing techniques for opinion mining systems , Information Fusion 36 ( 2017 ) 10 - 25 . URL: https://www.sciencedirect.com/science/ article/pii/S1566253516301117. doi:https://doi. org/10.1016/j.inffus. 2016 . 10 .004.

[2]

Koppel , J. Schler, The importance of neutral examples for learning sentiment ., Computational Intelligence 22 ( 2006 ) 100 - 109 . doi: 10 .1111/ j.1467- 8640 . 2006 . 00276 .x.

[3]

Pang ,

Lee , Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales , in: K. Knight,

H. T.

Ng , K. Oflazer (Eds.), Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), Association for Computational Linguistics , Ann Arbor, Michigan, 2005 , pp. 115 - 124 . URL: https://aclanthology.org/P05-1015. doi: 10 .3115/ 1219840.1219855.

[4]

Valdivia ,

M. V.

Luzón ,

Cambria ,

Herrera , Consensus vote models for detecting and filtering neutrality in sentiment analysis , Information Fusion 44 ( 2018 ) 126 - 135 . URL: https://www.sciencedirect.com/science/ article/pii/S1566253517306590. doi:https://doi. org/10.1016/j.inffus. 2018 . 03 .007.

[5]

Koudenburg ,

Kashima , A polarized discourse: Efects of opinion diferentiation and structural differentiation on communication , Personality and Social Psychology Bulletin 48 ( 2022 ) 1068 - 1086 . URL: https://doi.org/10.1177/01461672211030816. doi: 10 . 1177/01461672211030816, pMID: 34292094 .

[6]

Bowley , Elements of Statistics, Studies in economics and political science , P. S. King & son, 1917 . URL: https://books.google.it/books?id= M4ZDAAAAIAAJ .

[7]

Vassallo ,

Gabrieli ,

Basile ,

Bosco , The tenuousness of lemmatization in lexicon-based sentiment analysis , in: Proceedings of the Sixth Italian Conference on Computational Linguistics - CLiC-it 2019 , Academia University Press, 2019 .

[8]

Barbieri ,

Basile ,

Croce ,

Nissim ,

Novielli ,

Patti , Overview of the Evalita 2016 SENTIment POLarity Classification Task , in: Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016 ) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2016 ), CEUR-WS .org, 2016 .

[9]

Basile ,

Bolioli ,

Nissim ,

Patti ,

Rosso , Overview of the Evalita 2014 SENTIment POLarity Classification Task, in: Proceedings of the 4th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'14) , Pisa, Italy, 2014 . URL: https://inria.hal.science/hal-01228925. doi: 10 .12871/clicit201429.

[10]

Vassallo ,

Gabrieli ,

Basile ,

Bosco , Polarity imbalance in lexicon-based sentiment analysis , in: Proceedings of the Seventh Italian Conference on Computational Linguistics - CLiC-it 2020 , 2020 , pp. 457 - 463 . doi: 10 .4000/books.aaccademia. 8964 .

[11]

Basile ,

Lai , M. Sanguinetti, Long-term Social Media Data Collection at the University of Turin, in: Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), CEURWS.org, 2018 .

[12]

Basile ,

Nissim , Sentiment analysis on Italian tweets , in: Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , 2013 , pp. 100 - 107 .

[13]

Zanchetta ,

Baroni , Morph-it! A free corpusbased morphological resource for the Italian language , in: Proceedings of Corpus Linguistics 2005 , 2006 .