Approaches to sentiment analysis of the social network text data Vadim Moshkin Nadezhda Yarushkina Ilya Andreev Ulyanovsk State Technical University Ulyanovsk State Technical University Ulyanovsk State Technical University Ulyanovsk, Russia Ulyanovsk, Russia Ulyanovsk, Russia v.moshkin@ulstu.ru jng@ulstu.ru ia.andreev@ulstu.ru Abstract—The article provides an overview of the most  the presence of speech and spelling errors. modern approaches to sentiment analysis of text data. The features of using machine learning approaches and dictionary-  the use of smiles, emoji to give the message a certain based methods are also described. In addition, the description emotional coloring. of sentiment dictionaries and the most popular software for In this article we will consider the use of various existing sentiment analysis of data are given. An original approach was also proposed for sentiment analysis of text data using the algorithms for assessing the sentiment of social network texts integration of machine learning methods with the Wodr2vec within the framework of the developed software system for data vectorization algorithm. Also presented is the architecture Opinion Mining. The article proposes an original approach of the developed system for Opinion Mining data of social for analyzing the emotional coloring of text data using the networks. At the end of the article, experiments are presented integration of machine learning methods with the Wodr2vec to evaluate text reviews using the data from the IMDB portal algorithm. as an example, confirming the proposed approach. II.THE EXISTING METHODS AND SOFTWARE FOR SENTIMENT Keywords—sentiment analysis, word2vec, Opinion Mining, ANALYSIS OF TEXT DATA machine learning There are two main groups of methods for the I.INTRODUCTION automatic sentiment analysis of text data: Currently, the main source of information from where A. Statistical methods you can get knowledge about certain interests of the client, The basis of these methods is the use of machine prepare for him and proactively offer a new product or classifier. This classifier is learned on pre-marked texts in the service, are the Internet and social networks [1]. This first stages. Then the classifier builds a model for analyzing problem is solved by the Opinion mining. Opinion mining new documents using the knowledge gained. The algorithm for data from social networks contains two tasks: consists of:  morphological analysis to identify entities that will be  a collection of documents is collected for machine evaluated; classifier learning;  analysis of the sentiment of expressions related to this  each document is decomposed into a feature vector; entity.  the correct sentiment type is indicated for each By sentiment analyzing of the users’ text messages the document; researcher can draw conclusions about:  the selection of the classification algorithm and the  emotional evaluation of users of various events and method for learning the classifier; objects;  the resulting model is used to determine the  individual user preferences; documents sentiment of the new collection.  some features of the users’ nature [2]. The disadvantage of such methods is the need for a Sentiment analysis is a section of text mining, a system large amount of data for learning. for automatically extracting subjective opinions from text. The statistical approach widely uses the support Sentiment analysis determines the content of the text as vector method (SVM) [3], Bayesian models [4], various much as its tonality. types of regression [5], methods Word2Vec, Doc2Vec [6], CRF [7], convolutional and recurrent neural networks Automatic analysis of the tonality of the text is based on [8][9]. the technologies of linguistic interpretation of emotions, machine learning, extracting emotional meaning from Word2Vec. The Word2Vec method is based on the information, etc. vector representation of words and the determination of the The technology of sentiment analysis has become semantic proximity of lexical units based on their especially relevant with the development of Web 2.0, as a distribution in collections of texts on specific topics. tool for monitoring the views of millions of Web users. A big set of texts are input to Word2Vec. Specialized However, text data in social networks have the following vocabulary is created and at the same time is learned on this features: set of texts. At the second stage, the dictionary turns into a set of vector representations of words. This representation is  use whole and incomplete sentences. based on the contextual proximity of a given word: if the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science words are found in the text side by side often enough, then The disadvantage of this method is a significant amount there is a semantic connection between them, and therefore, of labor because the method requires the creation of many in the vector representation, these words will have close rules. coordinates. A mixed method is also sometimes used [14-16]. For this algorithm, two training methods were developed - CboW and Skip-gram. Schemes of these algorithms are C. Dictionaries and thesauri presented in Figure 1. The first algorithm is based on the There are a number of thesauri labeled with regard to the prediction of the next word in the sequence given during emotional component. These dictionaries are necessary for training. The second learning method works differently - it computer programs when analyzing the tonality of the text. predicts the surrounding words. The result of this method is WordNet-Affect is a semantic thesaurus in which the ability to calculate the "semantic distance" for each pair concepts related to emotions are represented using words that of words. have an emotional component. WordNet-Affect also uses additional emotional labels to separate synsets according to their emotional valency. To do this, four additional emotional labels are defined:  positive;  negative;  ambiguous;  neutral. SentiWordNet [17] is a lexical semantic thesaurus. The first version of SentiWordNet was developed in 2006. This Fig. 1. CboW and Skip-gram training methods. thesaurus appeared as a result of automatic annotation of each set of synonyms in accordance with its degree of Doc2Vec. The Doc2Vec method consists of two positivity, negativity and objectivity. methods: distributed memory (DM) and distributed word bag (DBOW). The DM method predicts a word from known SenticNet is another semantic thesaurus for working with prior words and a paragraph vector. The paragraph vector sets of emotional concepts. SenticNet is used to design does not move and takes into account the word order Despite intelligent applications designed to analyze the emotional the fact that the context moves through the text. DBOW component of text. The main purpose of SenticNet is to predicts random word groups in a paragraph based only on simplify the process of machine recognition of conceptual the paragraph vector. and emotional information transmitted using natural language. If we compare other lexical thesauruses, such as A serious disadvantage of this method is the complexity SentiWordNet and WordNet-Affect with SenticNet, then of the analysis of the training sample, which is why it is their main difference is that SentiWordNet and WordNet- extremely difficult to continuously update the model when Affect provide the linking of words and emotional concepts new training data is received. at the syntactic level, not allowing to reveal the semantic component. B. Methods based on dictionaries The method using dictionaries is based on the search for emotive vocabulary (lexical tonality) in the text according to pre-compiled tonal dictionaries and rules using linguistic analysis. These methods can use rule lists that are substituted into regular expressions and special rules for connecting tonal vocabulary within sentences [10]. Glossary terms must have a weight corresponding to the subject area of the document in order to classify the document with high accuracy. Emotion is taken into account in the algorithm when finding the marker. The result of the algorithm is the average emotional color of the text [11-12]. The following algorithm is usually used:  assign the sentiment score from the dictionary to each word in the text;  calculate the overall sentiment score of the entire text by adding the sentiment score of individual words [13]. Fig. 2 Developed clustering algorithm. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 199 Data Science D. Existing sentiment analysis software. vectors were used to most accurately identify words. The Currently there is a certain set of libraries and software resulting model is saved as a file. for sentiment analysis of text data. 3) The next stage is the clustering of vector words Chorus is a service for determining the emotional according to the K-Means method for splitting by synonyms coloring of email. This service was a startup and was and similar words. The number of clusters should be such developed by a company from Australia. Chorus is intended that on average there are 5 words per cluster, for the most for customer support services: accurate result.  recommends the following message for processing; It is required to prepare data for machine learning after  indicates a message that needs an urgent response; breaking all the significant words into clusters. A two- dimensional array is created for each file as follows:  indicates where you can save the client after the response.  the number of lines is equal to the number of text messages in the file; The disadvantage is the ability to analyze only emails. Currently no longer supported.  the number of columns is equal to the number of Sentiment Analysis with Python NLTK Text Classification clusters. [18] is a demo showing the capabilities of NLTK. He divides These data will be important in determining the the emotional coloring into positive, negative and neutral. An emotional coloring. API with restrictions and the ability to buy premium access is also offered. The demo sample is a form for manual verification with character size restrictions. Sentirength [19] is a library for analyzing emotional coloring. The algorithm is based on the search for the maximum tonality value in the text for each scale (ie, the search for the word with the maximum negative rating and the word with the maximum positive rating) [20]. As a result, Fig. 3 Random Forest Sentiment Analysis of the text. a double score (positive and negative) is given from 1 to 5. There are also options for triple and single assessment of The Random Forest model (Fig. 3) was used for machine results. This library is paid. You can check the library on the learning. The random forest method is currently one of the project website. most popular and effective methods for solving machine Tone Analyzer [21] is a service from IBM based on IBM learning problems, such as classification and regression. He Watson. This service uses linguistic analysis to detect trains not one decision tree with his weights, but many emotional and linguistic connotations in the written text. decision trees [23]. Options for using the analyzer are social listening, improving Predicting data and calculating the accuracy of the the quality of customer service and integration with chat algorithm is performed using a trained model. bots. This service is paid and supports only English. IV.SOFTWARE ARCHITECTURE FOR OPINION MINING SOCIAL III.SENTIMENT ANALYSIS USING MACHINE LEARNING AND MEDIA WORD2VEC. A module for assessing the tonality of texts in the The Random forest method of text sentiment analysis is a information system for Opinion Mining (Fig.4) was clustering method based on machine learning. developed to evaluate the effectiveness of the proposed Schematically, the developed algorithm is presented in algorithms [24-26]. the Fig.2. 1) Text data pre-processing is carried out at the first stage. The html code, any non-alphabetic characters, and also stop words are removed from the text. Stop words are phrases and words that do not carry a semantic load and make it difficult to index a page by search engines. Further, all remaining words are reduced to lowercase. 2) At the second stage, the text from these files (test and training) presented in the form of a list of significant words is processed using the Word2Vec tool. Word2vec is an open source tool for calculating word spacing provided by Google [22]. Word2Vec creates a special model that includes a dictionary of words with their vector representation. By the similarity of the values of the vectors, synonyms and similar words can be determined. 300-dimensional Fig. 4. The architectural scheme of the software system for Opinion Mining social media. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 200 Data Science TABLE I. RESULTS OF EXPERIMENTS This information system solves the following tasks: Number of vector Number of Min. number  extracts data from various social networks (Facebook, Accuracy spaces clusters of words Ok, VKontakte, Instagram, Twitter, etc.) 0.789 300 3956 60 0.776 200 2301 100  conducts preprocessing of the extracted data; 0.772 400 2301 100 0.771 300 1573 60  makes matching (comparison) of user profile data from different social networks; CONCLUSION  translates the extracted data into an internal format for Thus, in this paper, an approach to the analysis of the text storing knowledge; data of social networks was proposed. This approach is based on the integration of the word2vec vectorization algorithm  conducts semantic analysis of data using subject and the k-means clustering algorithm using a random forest ontologies to simplify the search; algorithm for training a neural network. This approach was  conducts sentimental analysis of the extracted data implemented in the Opinion Mining analysis system. using the developed algorithm. Experiments were conducted to evaluate the effectiveness The developed software system for Opinion Mining has a of this algorithm when analyzing user feedback from the service architecture and supports the REST architectural IMDB portal. The experiments showed that the Best result style. The ElasticSearch library is used to extract and was shown using 300-dimensional vectors, the minimum preprocess data. MongoDB is used to store a large set of number of repetitions of words was 60, and the number of data. The Sypher query language is used to search the graph vectors was calculated so that each cluster had an average of database Neo4j [27-28]. 5 words, i.e. 3956 clusters. In future works, we plan to hybridize this approach using V.EXPERIMENT RESULTS. well-known sentimental ontologies and dictionaries to take Experiments were conducted to determine the accuracy into account the peculiarities of word usage and language.. of estimating the emotional coloring of text data using the random forest method. ACKNOWLEDGMENT Test data is a data set from the IMDB site that contains This work was supported by the Russian Federal Property 100,000 detailed film reviews (positive and negative). 1,500 Fund. Projects No. 18-47-730035 and 18-47-732007. reviews were taken separately to verify accuracy. The REFERENCES maximum accuracy is 79% because some reviews do not [1] O. Shipilov and A. Belyaev, “Analysis of the emotional color of contain emotional coloring, but are only a retelling of the messages in the social network twitter,” Science Questions, vol. 3, plots of films, which lowered the accuracy of the program. pp. 91-98, 2016. [2] D. Vlasov, “Description of the information image of a social When using different parameters, the running time of the network user, taking into account its psychological characteristics,” algorithm ranged from 40 to 55 minutes. In the experiments, International Journal of Open Information Technologies, vol. 6. no. the optimal values of the algorithm's work were revealed, 4, 2018. such as the dimension of the vectors, the number of clusters [3] M.S. Sabuj, Z. Afrin and K.M.A. Hasan, “Opinion Mining Using Vector Machine for Web Based Diverse Data,” Pattern Recognition and the minimum amount of use of the word in the reviews and Machine Intelligence. Lecture Notes in Computer Science, vol. to make it important. 10597, pp. 673-678, 2017. The results of the experiments are presented in Table 1 [4] L.P. Dinu and I. Iuga, “The Best Feature of the Set,” Computational Linguistics and Intelligent Text Processing. Lecture Notes in and Fig.5. Computer Science, vol 7181, pp. 556-567, 2012. The best result was shown when using 300-dimensional [5] I. Chetviorkin and N. Loukachevitch, “Sentiment Analysis Track at vectors, the minimum number of repetitions of words equal ROMIP-2012,” Computational linguistics and intellectual technologies: Sat scientific articles, vol. 2, pp. 40-50, 2013. to 60 and the number of vectors calculated so that each [6] Q. Chen and M. Sokolova, “Word2Vec and Doc2Vec in cluster had an average of 5 words, i.e. 3956 clusters. Unsupervised Sentiment Analysis of Clinical Discharge Summaries,” CoRR abs 1805.00352, 2018. [7] A. Antonovam and A. Soloviev, “Using the conditional random fields method for processing texts in Russian,” Computational linguistics and intellectual technologies: Sat scientific articles, vol. 12, no. 19, pp. 27-44, 2013. [8] A. Maas, R. Daly, P. Pham, D. Huang, A. Ng and C. Potts, “Learning word vectors for sentiment analysis,” The International Language Technologies. International Association for Computational Linguistics, vol. 1, pp. 142-150, 2011. [9] Yu.V. Vizilter, V.S. Gorbatsevich and S.Y. Zheltov, “Structure- functional analysis and synthesis of deep convolutional neural networks,” Computer Optics, vol. 43, no. 5, pp. 886-900, 2019. DOI: 10.18287/2412-6179-2019-43-5-886-900. [10] H. Saif, “Contextual semantics for sentiment analysis of Twitter,” Information Processing & Management, vol. 52, no. 1, pp. 5-19, 2016. [11] A. Pak and P. Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining,” LREC, 2010. Fig. 5. Sentiment analysis using machine learning and Word2Vec. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 201 Data Science [12] A. Tarasova, “Synergy of interrogative and exclamation marks in [22] Watson Tone Analyzer [Online]. URL: https://www.ibm.com/cloud/ network texts (on the material of Tatar, Russian and English watson-tone-analyzer. languages),” Bulletin of Vyatka State University, vol. 4, 2015. [23] Introduction to Word Embedding and Word2Vec [Online]. URL: [13] S. Ionova, “Emotiveness of a Text as a Linguistic Problem,” Diss .... https://towardsdatascience.com/introduction-to-word-embedding- Cand. filol. Sciences, 1998. and-word2vec-652d0c2060fa. [14] B. Pang, L. Lee and S. Vaithyanathan, “Thumbs up?” Sentiment [24] J. Žižka, F. Dařena and A. Svoboda, “Random Forest,” 2019. DOI: Classification using Machine Learning Techniques, pp. 79-86, 2002. 10.1201/9780429469275-8. [15] P. Turney, “Thumbs Up or Thumbs Down? Semantic Orientation [25] N. Yarushkina, A. Filippov, M. Grigoricheva and V. Moshkin, “The Applied to Unsupervised Classification of Reviews,” Proceedings of Method for Improving the Quality of Information Retrieval Based on the Association for Computational Linguistics, pp. 417-424, 2002. Linguistic Analysis of Search Query,” Artificial Intelligence and [16] I.A. Rycarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of Soft Computing. Lecture Notes in Computer Science, vol. 11509, pp. media content from social networks using BigData technology,” 474-485, 2019. Computer Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/ [26] A. Pazelskaya and A. Soloviev, “Method for determining emotions 2412-6179-2018-42-5-921-927. in texts in Russian,” Computational linguistics and intellectual [17] V. Moshkin, N. Yarushkina and I. Andreev, "The Sentiment technologies: Sat scientific articles, vol. 11, no. 18, pp. 510-523, Analysis of Unstructured Social Network Data Using the Extended 2011. Ontology SentiWordNet," 12th International Conference on [27] A. Filippov, V. Moshkin and N. Yarushkina, “Development of the Developments in eSystems Engineering (DeSE), Kazan, Russia, pp. Social Media Analysis,” Recent Research in Control Engineering 576-580, 2019. and Decision Making. Studies in Systems, Decision and Control, vol. [18] Natural Language Processing APIs and Python NLTK Demos 199, pp. 421-432, 2019. [Online]. URL: https://text-processing.com/demo/sentiment/. [28] N. Yarushkina, A. Filippov, V. Moshkin, G. Guskov and A. [19] A. Esuli and F. Sebastiani, “SENTIWORDNET: A Guide for Romanov, “Intelligent Instrumentation for Opinion Mining in Social Respecting the Opthion Mining,” pp.417-422, 2006. Media,” Proceedings of the II International Scientific and Practical Conference Fuzzy Technologies in the Industry, Ulyanovsk, Russia, [20] SentiStrength [Online]. URL: http://sentistrength.wlv.ac.uk/. pp. 50-55, 2018. [21] I. Menshikov and A. Kudryavtsev, “Review of systems for analysis of tonality of a text in Russian,” Young scientist, no. 12, pp. 140- 143, 2012. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 202