1. Introduction

A Multimodal Visual Sentiment Analysis Framework Enhanced With Feature Pyramid Networks

Daniele Galletti

Valerio Ponzi

Samuele Russo

1 0 Institute for Systems Analysis and Computer Science, Italian National Research Council , Rome , Italy 1 Neuroimaging Laboratory, IRCCS Santa Lucia Foundation , Rome , Italy

55 63

Visual Sentiment Analysis aims to understand how images afect people in terms of evoked emotions. This paper presents a complete pipeline for comparing users' emotional responses to images, enabling the analysis of potential discrepancies between machine-inferred and subjective afective states. The proposed framework consists of three main stages. The first stage employs a Convolutional Neural Network (CNN) enhanced with Feature Pyramid Network (FPN) layers to extract multi-scale visual features. Experimental results show that incorporating three additional FPN layers improves performance while introducing only a negligible increase in model complexity. In the second stage, a multimodal approach is adopted, where visual features are integrated with textual features derived from captions generated by an Image Captioning model. This fusion enriches the emotional context by combining visual and linguistic cues. In the final stage, a grounding mechanism is applied to align and merge sentiments from the diferent modalities into a unified representation. The algorithm's output is then compared with the sentiment expressed by the user, enabling an analysis of the divergence between machine-inferred and human-perceived emotions.

eol>Visual Sentiment Analysis Feature Pyramid Network Multimodal Evaluation

1. Introduction

pipeline is built around three main stages. In the first stage, visual features are extracted from the image using Sentiment Analysis is a well-known field in machine an artificial neural network. In the second stage, a neural learning. The goal of sentiment analysis is to measure captioning model generates a description of the image, how certain topics afect people. The outcomes of this and in the third stage, the features from the first and the study are very important: having the perception of what second stage are mixed into a common representation. the common opinion is, influencing political, economic The CNN we use in the first stage is a novel archiand social aspects of an entire population [ 1, 2 ]. Despite tecture that integrates FPN layers into a CNN [ 8 ]. This its large use on text corpus and the huge availability of model aims to extract meaningful features at diferent data coming from social platforms, sentiment analysis is scales, having the benefits of a CNN for object detecstill far from achieving always good reliability. The lack tion and also exploiting low level features which have of context, the diferences between languages and cul- proven to be useful for sentiment classification [ 9 ]. The tures, create, in fact, very important barriers which make model achieves better results when compared to its presentiment classification a dificult task. Visual Sentiment decessor [ 3, 10 ] and more classical modeling techniques Analysis (VSA) [ 3, 4, 5, 6 ] was born as an additional in- [11, 12, 13, 14]. In the second step, a textual descripstrument to understand people’s sentiment. It emerged in tion, coming from the Image Captioning model recently the last decade, gaining traction with the increasing use presented by Wang et al. [15], is added to the features of images to express opinions on social media platforms. extracted in the first step. The description ofers an unImages ofer an additional channel capable of expressing biased representation unafected by the source of the much more information than text [ 7 ]. Images convey data. both semantic elements (e.g., objects, scenes) and emo- In the last step of the pipeline, a grounding technique tional nuances, ofering a richer medium than text. For is used to merge features coming from visual and textual this reason, social media platforms became very popular data. Textual features are converted into a sentiment disand VSA, consequently, started to grow. In this work, tribution using the Emotion Sensor dataset [16]. Visual we present a multimodal sentiment extraction pipeline. features, which are in another domain of emotions, are This pipeline aims to give a framework to assess how similarly converted into the same representation by using an image is classified in terms of evoked sentiment. The an association between the labels of the two representations. This was done since labels in every representation used in this work are meaningful in terms of sentiment content.

The result is then presented to the user. The user’s feedback, in the form of an audio file, is converted to text using the Speech Recognition API [17] and the sentiment is extracted by using the same technique of the third step. In this project we used three diferent datasets. The senThe result is also presented to the user, along with the timent extraction pipeline uses these datasets at diferent algorithm’s result. steps.

3. Datasets 2. Related Works

Research in Visual Sentiment Analysis has evolved significantly over the past decade, intersecting computer vision, afective computing, and multimodal learning. One of the ifrst paper presented in VSA field was in 2010 [ 18]. They did positive/negative classification using SIFT features extracted from images mixed with textual metadata associated with the image. Text to sentiment conversion was done using SentiWordNet [19], which was published the same year. The SentiWordNet corpus associates synsets to sentiment polarity. In 2013, Borth et al. [20] created a visual ontology in which sentiments in an image are represented by ANPs (adjective-noun pairs). In 2014 Chen et al. [ 3 ] presented DeepSentiBank, a CNN finetuned on the Flickr dataset which classified images into a 1553 (ANP) vector. This vector consists of a meaningful middle-level representation also exploited in this work. More recently, concerning the new CNN structures, Tianrong Raoa et al. [ 21 ] used a FRCNN (Faster R-CNN) based on FPN in order to extract the region of interest (RoI) in which sentiment is contained. Other region-based works on VSA were also presented [ 22 ]. Concerning recent studies, literature went towards multimodal extraction of features. In 2016 Katsurai and Satoh [ 23 ] used both hand-crafted features (SIFT and GIST) and text sentiment analysis on image metadata in order to predict the sentiment polarity. In 2018 Ortis et al. [ 24 ] used multimodal classification with visual features, metadata sentiment, and objective extraction of caption which was converted to text. Corchs et al. [ 25 ] presented a method that combines visual and textual features by employing an ensemble learning approach. In particular, the authors classified emotions by combining 5 state-of-the-art classifiers trained on visual and textual data. In recent studies, artificial intelligence systems have been successfully applied in real-life environments to assess and react to emotional states, as shown in psychoeducational robotics frameworks (Ponzi et al., 2021 [ 26 ]). Additionally, some recent approaches leverage eye-tracking data to infer user attention and emotional engagement with visual stimuli. These methods ofer a complementary channel to multimodal sentiment analysis by correlating gaze patterns with afective responses [ 27, 28, 29, 30 ].

Flickr Dataset The first dataset used, the Flickr

Dataset with CC, was created by Borth et. al. [20]. Images were automatically crawled from Flickr and filtered by their metadata, resulting in 487 256 weakly annotated samples. This dataset represents one of the first and most used dataset ever created for VSA tasks. Each of the 1553 classes is an Adjective-Noun pair (ANP), a mid-level representation for sentiment classification. To build this dataset the authors have crawled Flickr images and extracted textual tags associated with each sample. Most significant tags were then grouped and transformed into a set of pairs of adjective and noun. The pair adjectivenoun, called ANP, represents a more emotionally charged concept instead of nouns and adjectives by themselves. Despite its large use this dataset presents some limitations. It is weakly annotated (categorized automatically by metadata posted by users on social networks) and thus subjected to bias. The dataset is also highly unbalanced, the classes in fact present a big variation of samples, going from 23 to 1402 samples per class. We used this dataset in order to finetune the neural network models trained on object detection tasks. Further details are presented in the Implementation and in Result sections.

Emotion Dataset The second dataset, published in 2016 [ 31 ] and available on Github, provides 23 308 images manually annotated using the 8 basic emotions presented by Mikels et al. [ 32 ]. The team started from 3+ million images weakly labeled; they filtered and annotated images by designing a task in which a group of people is asked to answer simple questions. From the results, they’ve built the largest manually created dataset up to then. As a motivation for the work, they discussed the predominance, on existing datasets, of images associated with Fear and Sadness emotions (Figure 2). This predominance can result in unbalancing classes, which can prevent an algorithm from working correctly. Emotion Dataset has ofered a good benchmark option over the Flickr one since it is more properly classified, less biased, and less unbalanced. An example of images grouped by Mikels emotions is shown in Figure 1. In this work, Emotion Dataset is used to finetune the neural network models by adding a layer that maps ANPs representation (from the Flickr dataset) to Mikels emotions.

Emotion Sensor Dataset The third dataset used is the Full Emotion Sensor dataset [16]. This dataset associates

the most used 23 730 words coming from the internet to a distribution over 7 emotions. The dataset, whose preview at the current time is not available anymore [16], was created by collecting thousands of sentences from blogs and online posts. The authors then labeled manually and automatically the sentences using 7 emotions and calculated naive Bayes to classify words. The 7 emotions correspond to an extended version of the 6 Ekman basic emotions [33], by adding a neutral emotion in case of an equal distribution over the other 6. This dataset, made for NLP tasks, is used in this work in order to convert diferent representations of sentiment into a common one. The outcome of the algorithm will be a distribution over these 7 emotions. classes, both psychological studies and data analysis were performed. The most popular model in the literature is Plutchik’s Wheel of Emotions [34]. This model defines 8 basic emotions with 3 valences each, resulting in 24 total classes. In this work, we used three diferent representations. The first, used in Flickr dataset with CC [ 20], is the ANP representation. It consists of pairs of adjectives and nouns which are meaningful in terms of the emotion’s content. The second representation was introduced by Mikels et al. [ 32 ] and used in Emotion dataset [ 31 ]. It deifnes 8 classes of emotions as the results from an analysis on the IASP dataset. The third method was presented by Ekman et al. [33]. They found 6 basic emotions by categorizing facial expressions of individuals subjected to a test, which involved 10 diferent cultures. This representation was used in the Emotion Sensor dataset [16] by adding one additional neutral sentiment.

In this work, we tackle the problem of having diferent emotion representations by using a grounding technique that transforms all representations into one. Such technique assumes that there exists an association among diferent sentiment spaces since all the representations cover the same emotional content. The common representational model is chosen to be the extended Ekman representation, used in the Emotion Sensor dataset. Using this dataset, we convert the other two representations into a distribution over 7 basic emotions.

The conversion between Mikels’ representation and Ekman’s was performed using Mikels’ labels. The labels are directly mapped into a distribution by the Sensor dataset. Some labels are common to both representations; thus, the output distribution presents a big predominance of that emotion (example shown in Figure 3). Some other labels can give problems connected to their distribution. The Sensor dataset can present, in fact, some non-coherent distribution due to the poor quality of data and the nature of the dataset. This is reflected in the conversion as shown in Figure 4.

The ANP representation is converted into the distribution over 7 emotions using the same technique. Each of the words of the pairs corresponds to one distribution; the output is the sum over the two distributions.

5. Models 5.1. Visual Sentiment Extraction In this work, we use a Convolutional Neural Network

4. Emotion Representation to extract visual features from an image. The proposed CNN is a modification of a popular architecture for object There are several ways to represent a sentiment. Dif- detection [ 8 ]. We trained the architecture and tested it on ferent psychological studies have led to diferent ways the Flickr dataset as done by Chen et al. [ 3 ]. Aside from of representing human feeling in terms of basic emo- this, we’ve created a new architecture by introducing 3 tions. In order to create and categorize data under some Feature Pyramid layers. These layers extract low-level

5.2. Feature Pyramid Network Model 6. Data Bias Data bias is a problem still present in Sentiment Analysis. It is connected to the diferent cultures, languages, and contexts in which diferent people live. Most of the

datasets for VSA are, in fact, crawled from the internet similarities between images. 1357 faulty images were and automatically annotated from metadata. This way found in the dataset, which in total remained with 21 951 of proceeding can disadvantage the algorithm’s perfor- samples. The Emotion Sensor dataset presented some mance, since the images can be wrongly labeled. Train- lack of words useful in order to convert ANPs to sentiing a model by providing more input channels has been ment distributions. These words, when converted, are shown to be an efective way of tackling the bias problem replaced by their synonyms, provided by [39] and [40] [ 9 ]. Despite this, there is no manually annotated dataset English dictionaries. The synonyms, manually annotated, that provides both image and text channels. Text is in- were organized in a file. The user’s input is provided in stead available in large, weakly labeled datasets crawled audio file format. from the internet.

Some works tried to solve the bias problem by extracting objective features from data. These features do not 9. Results come from the same source as the training data, but they are generated from the elaboration of another Machine Learning algorithm. The final features come from diferent joint algorithms’ results. This approach has recently been revealed to be very efective [ 25 ], [ 24 ]. In this work, we adopt a similar approach to the one used by Ortis et al. [ 24 ]. We used an Image Captioning model to generate an objective description of the image. We then convert the caption to a sentiment distribution using the Emotion Sensor dataset [16]. The image captioning model used was recently presented in Wang et al. [15]. Once the caption is generated, relevant keywords are extracted for sentiment mapping. In order to filter keywords inside the phrase, we filtered English stopwords provided by the nltk corpus [36] and used the nltk POS tagger [37] and WordNet [38] to lemmatize the words if a correspondence is not found inside the Emotion Sensor dataset.

As we will show in the Result section, the extraction of a neutral description is efective, but is nothing without a good (and unbiased) conversion into the sentiment distribution.

Results shown here are relative to the benchmark com

puted on the datasets presented above.

The first result is relative to the ANP classification task using the Flickr dataset. The dataset was split before training into 3 subsets: training, evaluation, and test set. Since the dataset is very unbalanced, we’ve created the test set such that at least a number of samples remained in the training set. In this way, classes with few samples are guaranteed to have at least a certain number of images in the training set. The minimum number was chosen to be 14. The two models involved are the DeepSentiBank and the FPN model, both share the same backbone (AlexNet) pre-trained on the ImageNet task. The metrics used to evaluate the model are the top 1, top 3, and top 10 accuracy, the same used in the DeepSentiBank paper [ 3 ]. The training was done using a Stochastic Gradient Descent optimizer with learning rate parameter set to 1e-3, weight decay to 5e-4, and momentum to 0.9. The learning rate was shrunk by a factor of 10 every 20 epochs. The batch size was 16 samples. Both models were trained for 40 epochs. Table 1 shows the best performances achieved by both models. As shown 7. User’s input in Table 1 FPN model achieves +1% better performance in the three metrics with respect to the DeepSentiBank The user’s input represents the second input to the sys- model. The low-level features extracted by the FPN laytem. The audio is converted into text using the Speech ers contribute additional, complementary information Recognition API for Python [17] and converted into senti- that improves classification performance. The second ment distribution using the Emotion Sensor dataset [16]. result consists of the evaluation of the FPN model and This distribution is then presented to the user along with the DeepSentiBank model on the Emotion Dataset. Both the result from the pipeline. models were trained with a Stochastic Gradient Descent optimizer with a learning rate of 1e-3, a batch size of 16, and trained for 20 epochs. Model weights were initial8. Implementation details ized from the training on the Flickr dataset. The results presented in Table 9 are measured on test data. The results here confirm the previous statement about the FPN model. In this case, using a more balanced and unbiased dataset, the FPN reaches almost a +3% F1 score more than the DeepSentiBank model, confirming its potential in VSA tasks. The result of the FPN model training the last layer only reaches a comparable score with respect to the base finetuned DeepSentiBank model. This outcome is likely due to the fact that by training the last layer

The captioning model was used in inference mode, it

wasn’t used at training time for speed limitations. The Flickr Dataset with CC [20] was resized before feeding the algorithm since originally it was 60 GB large, unfeasible to use in the settings described above. The resized dimension is 9 GB. Concerning the Emotion Dataset a ifltering step was adopted since some images presented placeholders to indicate their unavailability. We removed them by using a hashing comparison which measures as a fearful word. The same happens for ‘illegal war’ which results to be a happy ANP according to the Emotion Sensor dataset. The presence of such outliers in the Emotion Sensor Dataset can cause a wrong sentiment classification.

The Mikels conversion is less afected by this kind of outlier, having fewer classes. The 8 classes are almost all classified in a balanced way. The only class which is clearly not classified correctly is the ‘amusement’ class. As shown in the Emotion Representation section, the Emotion Sensor dataset in fact associates the ’amusement’ word with a distribution which is not correct.

The issue of conversion afects also the captioning model, but no NLP evaluation test was done in this project. In this work we presented a pipeline that aims to be a systematic evaluation of a multimodal pipeline for automated sentiment inference from visual data. The full sentiment pipeline uses text, visual, and audio information in order to present a final result to the user. In this paper we focused the attention more on the Visual Sentiment task while leaving the other aspects to already developed algorithms.

We have demonstrated the efectiveness of FPN layers in the VSA task. Thanks to these layers, the network gains even more advantage using the Emotion dataset [ 31 ]. The FPN model has shown its improvement even by adding simple branches on the main backbone. The model used in this work was an old state-of-the-art network. It was used to have a direct comparison with the original paper in which ANPs were introduced. Many SOTA CNNs can be used for the same task. Future works could prove the performance gain by introducing FPN layers also in these novel structures.

This work attempts to address the multiple representation problem of sentiment by using an easy technique of conversion. We leveraged the Emotion Sensor dataset in order to extract a distribution associated with each word. The technique presented inaccuracies due to the need to have more solid bases on the dataset used. Further development could rely on more structured datasets, so that the last step performances can be improved.

In this section we present some examples concerning

the conversion of the ANP to Ekman and of the Mikels to Ekman representations. The content of this section gives additional material which justifies the results above. Some examples of wrong ANP conversion are shown in Figure 5.

As depicted in the figures, representations may fall into outlier values. We can see that ‘flufy hair’ ANP is associated with Fear as the predominant sentiment. This is because the Emotion Sensor dataset presents ‘flufy’

9.1. Emotion Conversion Results Another promising direction for future work is the Sentiment Analysis on audio data. Future development of this kind can bring important improvements to pipeline stages.

The VSA problem is still far from being solved. Exploiting multimodality is the key to reach further results. We have seen, although, that with the actual settings, data bias and unavailability of unique representations can make VSA as well as Sentiment Analysis a very dificult task. 11. Declaration on Generative AI

During the preparation of this work, the authors used

ChatGPT, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the publication’s

content. advanced drone control, Drones 9 ( 2025 ). doi: 10 . 3390/drones9020109. [11]

Zimatore ,

Serantoni ,

M. C.

Gallotta , References L. Guidetti , G. Maulucci, M. De Spirito , Automatic detection of aerobic threshold through re-

[1]

Oyebode ,

Orji , Social Media and Sentiment currence quantification analysis of heart rate time Analysis: The Nigeria Presidential Election 2019 , series, International Journal of Environmental Rein: 2019 IEEE 10th Annual Information Technology, search and Public Health 20 ( 2023 ). doi: 10 .3390/ Electronics and Mobile Communication Conference (IEMCON), 2019 , pp. 0140 - 0146 . ISSN: 2644 - 3163 . [12] iMj.eCr. pGh2al0lo0t3t1a , 9G98 .. Zimatore , L. Falcioni , S. Migli-

[2]

Brandizzi ,

Fanti ,

Gallotta ,

Russo , L. Ioc- accio, M. Lanza,

Schena ,

Biino , M. Giuriato, chi, D. Nardi,

Napoli , Unsupervised pose es- M. Bellafiore , A. Palma , et al., Influence of geographtimation by means of an innovative vision trans- ical area and living setting on children's weight former , in: Lecture Notes in Computer Science status, motor coordination, and physical activity , (including subseries Lecture Notes in Artificial In- Frontiers in pediatrics 9 ( 2022 ) 794284 . telligence and Lecture Notes in Bioinformatics) , vol- [13]

Zimatore ,

Cavagnaro , Recurrence analume 13589 LNAI , 2023 , p. 3 - 20 . doi: 10 .1007/ ysis of otoacoustic emissions, Understanding 978-3-031-23480-4 _ 1 .

Complex

Systems ( 2015 ) 253 - 278 . doi: 10 .1007/

[3]

Chen ,

Borth ,

Darrell ,

S.-F.

Chang , DeepSentiBank: Visual Sentiment Concept Classifica- [ 14 ] 9M7 . 8C -. 3G-a3ll1o9tt-a0 , V71 . B5o5n -a8v_ o8lo .ntà, G. Zimatore, S. Iaztion with Deep Convolutional Neural Networks, zoni , L. Guidetti,

Baldari , Efects of open (racket) Technical Report arXiv:1410.8586 , arXiv, 2014 . and closed (running) skill sports practice on chilArXiv:1410.8586 [cs] type: article. dren's attentional performance , Open Sports Sci-

[4]

I. E.

Tibermacine ,

Tibermacine , W. Guettala, ences Journal 13 ( 2020 ) 105 - 113 . doi: 10 .2174/ C. Napoli,

Russo , Enhancing sentiment analysis on seed-iv dataset with vision transformers: A [15] 1P8 . 7W5a3n9g9 , XA02 . 0Y1a3n0g1 , 0R1 .0M5.en, J. Lin , S.

Bai , Z.

Li , comparative study , in: Proceedings of the 2023 11th J . Ma,

Zhou ,

Yang , OFA : Unifyinternational conference on information technol- ing Architectures , Tasks, and Modalities Through ogy: IoT and smart city, 2023 , pp. 238 - 246 . a Simple Sequence-to- Sequence Learning Frame-

[5]

Randieri ,

Pollina ,

Puglisi ,

Napoli , Smart work, Technical Report arXiv:2202.03052 , arXiv, glove : A cost-efective and intuitive interface for 2022 . ArXiv: 2202 .03052 [ cs] type: article. advanced drone control , Drones 9 ( 2025 ). doi:10. [16]

Bil , Full Emotions Sensor Dataset Containing Top 3390/drones9020109. 23 730 English Words Classified Statistically Into 7

[6]

Iacobelli ,

Russo , C. Napoli, A machine learning Basic Emotions, 2022. based real-time application for engagement detec - [17]

Zhang (Uberi), SpeechRecognition: Library for tion , in: CEUR Workshop Proceedings , volume performing speech recognition, with support for 3695 , 2023 , p. 75 - 84 . several engines and APIs, online and ofline ., 2017 .

[7]

Napoli ,

Ponzi ,

Puglisi , S. Russo, [18]

Siersdorfer ,

Minack ,

Deng , J. Hare, AnalyzI. E. Tibermacine, Exploiting robots as healthcare ing and predicting sentiment of images on the social resources for epidemics management and support web, in: Proceedings of the 18th ACM international caregivers , in: CEUR Workshop Proceedings , vol- conference on Multimedia, MM '10, Association

for

ume 3686 , 2024 , p. 1 - 10 . Computing Machinery, New York, NY, USA, 2010 ,

[8]

Krizhevsky , I. Sutskever,

G. E.

Hinton , ImageNet pp. 715 - 718 . Classification with Deep Convolutional Neural Net- [19]

Esuli ,

Sebastiani , SENTIWORDNET: A Pubworks, in: Advances in Neural Information Process- licly Available Lexical Resource for Opinion Mining Systems , volume 25 , Curran

Associates

, Inc., ing, in: Proceedings of the Fifth International 2012. Conference on Language Resources and Evaluation

[9]

Ponzi ,

Russo ,

Wajda ,

Napoli , A com- ( LREC'06) , European Language Resources Associaparative study of machine learning approaches for tion (ELRA), Genoa , Italy, 2006 . autism detection in children from imaging data , in: [20]

Borth ,

Ji ,

Chen ,

Breuel ,

S.-F.

Chang , LargeCEUR Workshop Proceedings , volume 3398 , 2022 , scale visual sentiment ontology and detectors using p. 9 - 15 . adjective noun pairs , in: Proceedings of the 21st

[10]

Randieri ,

Pollina ,

Puglisi ,

Napoli , Smart ACM international conference on Multimedia, MM glove: A cost-efective and intuitive interface for '13, Association for Computing Machinery , New York, NY, USA, 2013 , pp. 223 - 232 .

[21]

Rao ,

Li ,

Zhang , M. Xu, Multi-level [33]

Ekman ,

W. V.

Friesen , M. O'Sullivan , A. Chan , Region-based Convolutional Neural Network for I. Diacoyanni-

Tarlatzis , K.

Heider , R.

Krause , W. A.

Image Emotion

Classification

, Neurocomputing LeCompte, T. Pitcairn,

P. E.

Ricci-Bitti , Universals 333 ( 2019 ). and cultural diferences in the judgments of facial

[22]

Yang ,

She ,

Sun , M.-M. Cheng, P. L. Rosin, expressions of emotion, J Pers Soc Psychol 53 ( 1987 ) L. Wang , Visual Sentiment Prediction Based on 712-717. Automatic Discovery of Afective Regions , IEEE [34]

Plutchik , Chapter 1 - A GENERAL PSYCHOTransactions on Multimedia 20 ( 2018 ) 2513 - 2525 . EVOLUTIONARY THEORY OF EMOTION , in: Conference Name: IEEE Transactions on Multime- R. Plutchik , H. Kellerman (Eds.), Theories of Emodia. tion, Academic Press, 1980 , pp. 3 - 33 .

[23]

Katsurai ,

Satoh , Image sentiment analysis [35] T.-Y. Lin , P.

Dollár , R.

Girshick , K.

He , B.

Harihausing latent correlations among visual, textual, and ran, S. Belongie, Feature Pyramid Networks for Obsentiment views , 2016 . ject Detection, Technical Report arXiv:1612 .03144,

[24]

Russo ,

Fiani ,

Napoli , Remote eye movement arXiv, 2017 . ArXiv: 1612 .03144 [ cs] type: article. desensitization and reprocessing treatment of long - [36] NLTK :: Natural Language Toolkit , 2022 . covid- and post- covid-related traumatic disorders: [37] S. Loria , textblob-aptagger, 2022 . Original-date: An innovative approach , Brain Sciences 14 ( 2024 ). 2013 - 09 -18T20: 03 :40Z. doi: 10 .3390/brainsci14121212. [38]

G. A.

Miller , WordNet: a lexical database for English,

[25]

Corchs ,

Fersini ,

Gasparini , Ensemble learn- Commun. ACM 38 ( 1995 ) 39 - 41 . ing on visual and textual data for social image emo- [39] Oxford Learner's Dictionaries | Find definitions, tion classification , Int. J. Mach. Learn. & Cyber . 10 translations, and grammar explanations at Oxford ( 2019 ) 2057 - 2070 . Learner's Dictionaries , 2022 .

[26]

Ponzi ,

Russo ,

Bianco ,

Napoli , W. Agata, [ 40 ] Thesaurus.com - The world's favorite online theet al ., Psychoeducative social robots for an healthier saurus! , 2022 . lifestyle using artificial intelligence: a case-study , in: CEUR Workshop Proceedings , volume 3118 , CEUR-WS , 2021 , pp. 26 - 33 .

[27]

Iacobelli ,

Ponzi ,

Russo ,

Napoli , Eyetracking system with low-end hardware: development and evaluation , Information 14 ( 2023 ) 644 .

[28]

Iacobelli ,

Pelella ,

Ponzi ,

Russo ,

Napoli , et al., A fast and accessible neural network based eye-tracking system for real-time psychometric and hci applications , in: CEUR WORKSHOP PROCEEDINGS , volume 3870 , CEUR-WS , 2024 , pp. 32 - 41 .

[29]

Napoli ,

Ponzi ,

Puglisi ,

Russo ,

Tibermacine , et al., Exploiting robots as healthcare resources for epidemics management and support caregivers , in: CEUR Workshop Proceedings , volume 3686 , CEUR-WS , 2024 , pp. 1 - 10 .

[30]

Boutarfaia ,

Russo ,

Tibermacine ,

I. E.

Tibermacine , Deep learning for eeg-based motor imagery classification: Towards enhanced human-machine interaction and assistive robotics , in: CEUR Workshop Proceedings , volume 3695 , 2023 , p. 68 - 74 .

[31]

You ,

Luo ,

Jin ,

Yang , Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark , Technical Report arXiv:1605.02677 , arXiv, 2016 . ArXiv: 1605 .02677 [ cs] type: article.

[32]

J. A.

Mikels ,

B. L.

Fredrickson ,

G. R.

Larkin ,

C. M.

Lindberg ,

S. J.

Maglio ,

P. A.

Reuter-Lorenz , Emotional category data on images from the international afective picture system , Behavior Research Methods 37 ( 2005 ) 626 - 630 .