=Paper=
{{Paper
|id=Vol-2667/paper46
|storemode=property
|title=An approach to the training dataset formation for assessing the sentiment degree of social network posts using machine learning
|pdfUrl=https://ceur-ws.org/Vol-2667/paper46.pdf
|volume=Vol-2667
|authors=Andrey Konstantinov
}}
==An approach to the training dataset formation for assessing the sentiment degree of social network posts using machine learning ==
An approach to the training dataset formation for assessing the sentiment degree of social network posts using machine learning Andrey Konstantinov Ulyanovsk State Technical University Ulyanovsk, Russia adwaises@mail.ru Abstract—This article describes an approach to the Conducting experiments that show the effectiveness formation of a training set for assessing the emotional coloring of determining the emotional coloring of a post with of social network posts. The dataset is formed in an automated a trained neural network. mode. The input values of the algorithm are 2.5 million posts of a social network, the output values is a training set neural II. ANALOGS network. The algorithm for the formation of the training set is based on selection using copyright symbols for expressing Currently, the works of Russian researchers offer various emotions and key phrases. The quality of the training set is methods for the formation of training datasets. checked during the training of the multilayer perceptron by The first method of generating training datasets is the set obtained and experiments. The accuracy of determining described in [2]. The essence of the method is to minimize the emotional coloring of posts of a social network by a neural the training dataset. The training dataset should fully network is about 67%. describe the behavior of the model. To minimize the amount of experimental data when training a neural network, it is Keywords—data analysis, sentiment analysis, natural language processing, social network proposed to synthesize the missing training pairs from a previously constructed mathematical model. I. INTRODUCTION In [3], several methods for the formation of a training The study of social networks is becoming increasingly dataset are described. The first way is the software important every year due to the growing need to ensure generation. The essence of the method is to vary as many public safety and monitor public sentiment. An analysis of parameters as possible during the sampling process. posts can help assess changes in the mood of many users and The second way is sampling. The essence of the method find application in political and social studies, including is to set the distribution in the space of objects. This method consumer research. is used to examine not all data, but only meaningful parts. Currently, neural networks are used to solve various The next method is the natural modification of the base problems in the field of intelligent data processing. The object. The training set is obtained by modifying the deployment of a neural network is carried out in two stages. parameters. Choice of neural network architecture. A fourth example is fetching from a database of objects. Creation of a training dataset [1]. The bottom line is to group objects into groups. Moreover, the objects of a certain group will be closer to each other, and The training dataset preparation phase takes a lot of time. further from different groups. In many cases, the expert analyzes and generates a training dataset in manual mode and spends a lot of time. In [4], a research prototype of a text tonality analyzer is described, which implements a step-by-step process of text The purpose of this work is to develop an experimental processing. At the first stage, the text is divided into separate model of a software system for determining the emotional sentences, and sentences into separate words. At the second coloring of posts on a social network based on copyright stage, a morphological analysis of each word, lemmatization symbols for expressing emotions. and determination of parts of speech are performed. The The main tasks are presented below. listed stages of the sentence analysis are necessary for the exact matching of the words found to the tonal dictionary. Analysis of the subject area, which includes the Tonal dictionaries are used for Russian-language text with a determination of the source data for the formation volume of about 35,000 words. In the dictionary, each word of the training dataset and classes of emotional corresponds to a tonal score. This indicator is a set of five coloring of posts; values. Each value determines the degree to which a word belongs to one of the classes: extremely negative, negative, A review of existing solutions and studies that were neutral, positive, extremely positive. proposed by Russian researchers; Also in the course of work, software systems and Development of a methodology for the formation of modules that perform sentiment analysis of texts were a training dataset, which is based on the methods of considered. The SentiFinder module [5] defines three types linguistic analysis of text information; of tonality of Russian-language texts: positive, negative and Software implementation; neutral. Tonality is defined relative to a given tonality object within a single sentence or throughout a document. The average accuracy for the three types of tonality is about 87%. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science There are some thesauruses specifically marked out A neural network only works with vectors, so texts must taking into account the emotional component. Such be represented in vector form. To represent the training dictionaries are necessary for computer programs in the dataset in the form of vectors, the word2vec algorithm was analysis of the tonality of the text. WordNet-Affect is a used [10]. Initially, a list of all the words in the posts is semantic thesaurus in which concepts are associated with compiled. Previously, all words were reduced to the initial emotions and are represented using words with an emotional form using lemmatization. Then, vectors are created whose component [6]. WordNet-Affect also uses additional size is equal to the size of the list of all words. After the emotional labels to separate synsets according to their vector is set to 1 if the word occurs in the post, otherwise 0 if emotional valency. To do this, four additional emotional not. labels are defined: positive, negative, ambiguous, and neutral. A multilayer perceptron with three layers was used as a neural network. The number of neurons in the first layer is SentiWordNet is a lexical-semantic thesaurus, the first equal to the size of the list of all dictionary words. The version of which was developed in 2006 [7]. This system is number of neurons in the second layer is equal to the size of the result of the process of automatic annotation of a set of the first divided by 50. The size of the second layer was synonyms by its degree of positivity, negativity and selected by conducting many experiments. For a dictionary objectivity. Using SentiWordNet provides more than a 20% of 2000 words, the size of the second layer will be 400 increase in accuracy compared to the first version [8]. neurons. The number of neurons of the third layer is equal to three since we need to determine seven emotions. SenticNet is another semantic thesaurus for working with sets of emotional concepts [9]. SenticNet is used to design After training the neural network, a test set is input. Each intelligent applications for analyzing the emotional post of the set is also transformed into a vector based on the component of the text. The main purpose of SenticNet is to dictionary that was obtained during the training of the neural simplify the process of machine recognition of conceptual network. and emotional information that is transmitted using natural language. The main difference between the considered A. Formal Description of the System thesauruses is that SentiWordNet and WordNet-Affect Formally, the process of selecting posts can be provide the linking of words and emotional concepts at the represented by a flowchart in Figure 1. The flowchart syntactic level and do not allow to reveal the semantic describes the process of selecting posts for the formation of a component. training dataset. Each stage of the selection contains the processes of selecting posts for each specific emotion. Considered scientific works describe only general recommendations for the formation of the training dataset but do not provide methods or algorithms that would allow Start the formation of a high-quality training dataset for sentiment analysis in an automated mode. The accumulated knowledge in the study of research can be used in the performance of Selection of posts based on dictionaries of this work. copyright symbols for expressing emotions III. MODELS AND ALGORITHMS The most popular method for creating a training dataset Selection of posts based on dictionaries of is the selection by keywords and phrases. When using this key phrases method, dictionaries of copyright symbols of expression of emotions and dictionaries of key phrases are used. Dictionaries of copyright symbols of expression of End emotions were compiled by an expert. Each dictionary is compiled for a specific emotion and contains several Fig. 1. Post selection process. copyright symbols for expressing emotions. Dictionaries of key phrases were found on the Internet and supplemented by At the first stage, posts are selected based on dictionaries analyzing posts on the social network. of copyright symbols of expression of emotions for each At the first stage, posts are selected based on dictionaries class of emotional coloring of the text. In the second stage, of copyright symbols for expressing emotions. As input posts are selected based on dictionaries of key phrases. In the information, 2.5 million posts from the database are taken. If third stage, posts are selected whose length is less than the a post contains an author’s symbol for expressing emotions, specified length. A length restriction was introduced because then it belongs to a specific class and is added to the training the neural network in large posts reduces the corresponding list. accuracy of recognition of the emotional coloring of the text [11]. In the second stage, posts are selected based on dictionaries of key phrases. The input information is the lists Formally, a lot of dictionaries by which posts are selected that were received at the previous stage. At this stage, the can be represented by the formula (1) lemmatization of each post word is performed. Then the post D = {DE, DW} (1) is checked for the content of each word from the dictionary. If the post contains a phrase, then it belongs to a specific where DE is a set of dictionaries with copyright symbols for class of emotional coloring. At the output, the data is written expressing emotions, DW - many dictionaries with keywords into text files, each of which contains a training dataset of a and phrases. particular class of emotional coloring. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 212 Data Science In turn, many dictionaries with copyright symbols for dictionaries with key phrases are read, then the posts are expressing emotions can be represented by the formula (2) lemmatized and the key phrases are selected. Then the selected posts are saved in text files. After the formation of DE = {DEjoy , DEsad , DEsurp , DEanger , DEdisg , DEcont , DEfear} (2) the training dataset, training and testing of the accuracy of where DEjoy – dictionary with emotion «joy», DEsad – determining the emotional coloring of posts by the neural dictionary with emotion «sad», DEsurp – dictionary with network takes place. emotion «surprise», DEanger – dictionary with emotion When building the software system, the following «anger», DEdisg – dictionary with emotion «disgust», DEcont – libraries were used. dictionary with emotion «contempt», DEfear – dictionary with emotion «fear». Lucene Russian Morphology is a library of In turn, many dictionaries with keywords can be morphological analysis [12]. This library performs a represented by the formula (3) morphological analysis of the word. The library allows you to perform lemmatization of the source word in Russian and DW ={DWjoy , DWsad , DWsurp , DWanger , DWdisg , DWcont , DWfear} (3) get information about part of speech. Lucene uses vocabulary base morphology with some heuristics for unknown words where DWjoy – dictionary with emotion « joy», DWsad – and supports homonyms. dictionary with emotion « sad», DWsurp – dictionary with emotion «surprise», DWanger – dictionary with emotion Encog Machine Learning Framework is a machine «anger», DWdisg – dictionary with emotion «disgust», DWcont learning library [13]. The library supports various learning – dictionary with emotion «contempt», DWfear – dictionary algorithms. The main advantage of the library is the neural with emotion «fear». network algorithms. The library contains classes for creating Each process of selecting posts for a specific emotion is a wide range of networks and supports classes for associated with a dictionary with the author's symbols for normalizing and processing data for these neural networks. expressing emotions of DE and a dictionary of DW key Multithreading is used to provide optimal learning phrases. performance on multicore machines. The process of testing the training dataset can be PostgreSQL JDBC Driver is a library that provides represented by a flowchart in Figure 2. access to the PostgreSQL database [14]. The library provides a connection to the database and interaction with it. As parameters, the library accepts the database address and port, Start login, and password for the connection. Further, the library receives SQL queries to the database input and returns the data. Creation of the vectors of the text V. EXPERIMENTS We will evaluate the quality of the generated training Neural network training dataset as the accuracy of determining the emotional coloring of the text by a neural network. For the experiments, the following parameters were Assessing the accuracy of sentiment chosen: a different number of posts in the training set and analysis of a text two methods of text processing - stemming and lemmatization. The accuracy of the system was measured at test posts, each of which belongs to one category. Finish The quality of the training dataset will be defined as the number of correct conclusions divided by the number of test posts. The experimental results are shown in Table 1. Fig. 2. Learning set validation process. TABLE I. STEMMING AND LEMMATIZATION EXPERIMENTS At the first stage, a set of vectors is formed using the word2vec algorithm. Next is the training of the neural Count posts Stemming Lemmatization network. And then an assessment of the accuracy of 20 4/7 6/7 determining the emotional coloring of the text using a test 50 6/7 7/7 set. 100 4/7 7/7 IV. SOFTWARE IMPLEMENTATION 200 4/7 7/7 To evaluate the effectiveness of the developed approach 300 5/7 7/7 to the formation of the training dataset, a software system was implemented. The experiments performed show that the training dataset, formed with the method of lemmatization, is The system reads data from the database, dictionaries obtained better than with the method of stemming. Table 1 with copyright symbols of expression of emotions and shows that the accuracy of the recognition of posts by a keywords for each emotion, lemmatization, the formation of neural network is much higher when a training dataset is a training dataset and training the neural network. formed using the lemmatization method. The experimental First, dictionaries are read with copyright symbols for results are also presented in the form of a graph in Figure 3. expressing emotions, and then posts are selected. After that, VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 213 Data Science The training dataset is created in an automated mode using dictionaries of copyright symbols for expressing emotions and dictionaries of key phrases. The neural network correctly determines the class of emotional coloring of the post with an accuracy of 67%. The neural network recognizes emotions of joy, sadness and disgust with an accuracy of 75%. In the future, it is planned to improve the training dataset generation algorithm. Compiled dictionaries will be expanded and updated. To test the set, neural networks of various architectures, for example, deep learning, will be used. Fig. 3. Stemming and lemmatization. ACKNOWLEDGMENT Additionally, 1,400 posts were submitted to the neural network. 200 posts from each class. The experimental results This work was supported by the Russian Federal Property are presented in Table 2. Fund. Projects No. 18-47-730035 and 18-47-732007. TABLE II. EXPERIMENT RESULTS REFERENCES [1] Yu.V. Vizilter, V.S. Gorbatsevich and S.Y. Zheltov, “Structure- Emotion Total + - functional analysis and synthesis of deep convolutional neural Joy 200 148 52 networks,” Computer Optics, vol. 43, no. 5, pp. 886-900, 2019. DOI: 10.18287/2412-6179-2019-43-5-886-900. Sad 200 154 46 [2] D.A. Grishelenok and A. A. Kovel, “Using the results of Anger 200 110 90 mathematical planning of an experiment in the formation of a training dataset of a neural network: article,” Krasnoyarsk: SibSAU, 2010. Surprise 200 126 74 [3] I.L. Kaftannikov and A.V. Parasich, “Problems of forming a training Fear 200 101 99 dataset in machine learning problems,” Bulletin of SUSU. Series Computer technology, control, electronics, vol. 16, no. 3, pp. 15-24, Disgust 200 151 49 2016. Contempt 200 121 79 [4] R.V. Posevkin and I.A. Immortal, “The use of sentiment analysis of texts to assess public opinion,” Scientific and Technical Journal of Sum: 1400 936 464 Information Technologies, Mechanics, and Optics, vol. 15, no. 1, pp. 169-171, 2015. Percent: 0.,669 0.331 [5] SentiFinder module [Online]. URL: eurekaengine.ru. [6] Thesaurus WordNet [Online]. URL: http://wndomains.fbk.eu/ Experiments show that the neural network correctly wnaffect.html. recognizes emotion with an accuracy of 67%. Best of all, a [7] V. Moshkin, N. Yarushkina and I. Andreev, “The Sentiment Analysis of Unstructured Social Network Data Using the Extended Ontology neural network determines joy, sadness and disgust with an SentiWordNet,” IEEE 12th International Conference on accuracy of about 75%. The results of the experiment are Developments in eSystems Engineering (DeSE), Kazan, Russia, pp. also presented in the form of a graph in Figure 5. 576-580, 2019. DOI: 10.1109/DeSE.2019.00110. [8] Thesaurus SentiWordNet [Online]. URL: http://sentiwordnet.isti.cnr. it. [9] SenticNet Thesaurus [Online]. URL: https://sentic.net. [10] Word2Vec Algorithm [Online]. URL: https://neurohive.io/ru/. [11] I.A. Rycarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of media content from social networks using BigData technology,”Computer Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179- 2018-42-5-921-927. [12] Library of morphological processing Russian Morphology: Russian [Online]. URL: https://github.com/AKuznetsov/russianmorphology. [13] Neural network library Encog Machine Learning Framework [Online]. URL: https://www.heatonresearch.com/encog/. [14] PostgreSQL JDBC Driver Database Access Library [Online]. URL: Fig. 4. Experiment results. https://jdbc.postgresql.org /. VI. CONCLUSION As a result of the robots, an expert system was developed to determine the emotional coloring of social network posts. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 214