=Paper=
{{Paper
|id=Vol-2667/paper46
|storemode=property
|title=An approach to the training dataset formation for assessing the sentiment degree of social network posts using machine learning 
|pdfUrl=https://ceur-ws.org/Vol-2667/paper46.pdf
|volume=Vol-2667
|authors=Andrey Konstantinov
}}
==An approach to the training dataset formation for assessing the sentiment degree of social network posts using machine learning ==
<pdf width="1500px">https://ceur-ws.org/Vol-2667/paper46.pdf</pdf>
<pre>
     An approach to the training dataset formation for
     assessing the sentiment degree of social network
               posts using machine learning
                                                             Andrey Konstantinov
                                                      Ulyanovsk State Technical University
                                                              Ulyanovsk, Russia
                                                              adwaises@mail.ru

    Abstract—This article describes an approach to the                                  Conducting experiments that show the effectiveness
formation of a training set for assessing the emotional coloring                         of determining the emotional coloring of a post with
of social network posts. The dataset is formed in an automated                           a trained neural network.
mode. The input values of the algorithm are 2.5 million posts of
a social network, the output values is a training set neural                                          II. ANALOGS
network. The algorithm for the formation of the training set is
based on selection using copyright symbols for expressing
                                                                                Currently, the works of Russian researchers offer various
emotions and key phrases. The quality of the training set is                 methods for the formation of training datasets.
checked during the training of the multilayer perceptron by                     The first method of generating training datasets is
the set obtained and experiments. The accuracy of determining                described in [2]. The essence of the method is to minimize
the emotional coloring of posts of a social network by a neural              the training dataset. The training dataset should fully
network is about 67%.                                                        describe the behavior of the model. To minimize the amount
                                                                             of experimental data when training a neural network, it is
   Keywords—data analysis, sentiment               analysis,   natural
language processing, social network
                                                                             proposed to synthesize the missing training pairs from a
                                                                             previously constructed mathematical model.
                        I. INTRODUCTION                                          In [3], several methods for the formation of a training
    The study of social networks is becoming increasingly                    dataset are described. The first way is the software
important every year due to the growing need to ensure                       generation. The essence of the method is to vary as many
public safety and monitor public sentiment. An analysis of                   parameters as possible during the sampling process.
posts can help assess changes in the mood of many users and
                                                                                 The second way is sampling. The essence of the method
find application in political and social studies, including
                                                                             is to set the distribution in the space of objects. This method
consumer research.
                                                                             is used to examine not all data, but only meaningful parts.
   Currently, neural networks are used to solve various
                                                                                The next method is the natural modification of the base
problems in the field of intelligent data processing. The
                                                                             object. The training set is obtained by modifying the
deployment of a neural network is carried out in two stages.
                                                                             parameters.
        Choice of neural network architecture.
                                                                                 A fourth example is fetching from a database of objects.
        Creation of a training dataset [1].                                 The bottom line is to group objects into groups. Moreover,
                                                                             the objects of a certain group will be closer to each other, and
    The training dataset preparation phase takes a lot of time.              further from different groups.
In many cases, the expert analyzes and generates a training
dataset in manual mode and spends a lot of time.                                  In [4], a research prototype of a text tonality analyzer is
                                                                             described, which implements a step-by-step process of text
   The purpose of this work is to develop an experimental                    processing. At the first stage, the text is divided into separate
model of a software system for determining the emotional                     sentences, and sentences into separate words. At the second
coloring of posts on a social network based on copyright                     stage, a morphological analysis of each word, lemmatization
symbols for expressing emotions.                                             and determination of parts of speech are performed. The
    The main tasks are presented below.                                      listed stages of the sentence analysis are necessary for the
                                                                             exact matching of the words found to the tonal dictionary.
         Analysis of the subject area, which includes the                   Tonal dictionaries are used for Russian-language text with a
          determination of the source data for the formation                 volume of about 35,000 words. In the dictionary, each word
          of the training dataset and classes of emotional                   corresponds to a tonal score. This indicator is a set of five
          coloring of posts;                                                 values. Each value determines the degree to which a word
                                                                             belongs to one of the classes: extremely negative, negative,
         A review of existing solutions and studies that were
                                                                             neutral, positive, extremely positive.
          proposed by Russian researchers;
                                                                                 Also in the course of work, software systems and
         Development of a methodology for the formation of
                                                                             modules that perform sentiment analysis of texts were
          a training dataset, which is based on the methods of
                                                                             considered. The SentiFinder module [5] defines three types
          linguistic analysis of text information;
                                                                             of tonality of Russian-language texts: positive, negative and
         Software implementation;                                           neutral. Tonality is defined relative to a given tonality object
                                                                             within a single sentence or throughout a document. The
                                                                             average accuracy for the three types of tonality is about 87%.


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

    There are some thesauruses specifically marked out                        A neural network only works with vectors, so texts must
taking into account the emotional component. Such                         be represented in vector form. To represent the training
dictionaries are necessary for computer programs in the                   dataset in the form of vectors, the word2vec algorithm was
analysis of the tonality of the text. WordNet-Affect is a                 used [10]. Initially, a list of all the words in the posts is
semantic thesaurus in which concepts are associated with                  compiled. Previously, all words were reduced to the initial
emotions and are represented using words with an emotional                form using lemmatization. Then, vectors are created whose
component [6]. WordNet-Affect also uses additional                        size is equal to the size of the list of all words. After the
emotional labels to separate synsets according to their                   vector is set to 1 if the word occurs in the post, otherwise 0 if
emotional valency. To do this, four additional emotional                  not.
labels are defined: positive, negative, ambiguous, and
neutral.                                                                      A multilayer perceptron with three layers was used as a
                                                                          neural network. The number of neurons in the first layer is
    SentiWordNet is a lexical-semantic thesaurus, the first               equal to the size of the list of all dictionary words. The
version of which was developed in 2006 [7]. This system is                number of neurons in the second layer is equal to the size of
the result of the process of automatic annotation of a set of             the first divided by 50. The size of the second layer was
synonyms by its degree of positivity, negativity and                      selected by conducting many experiments. For a dictionary
objectivity. Using SentiWordNet provides more than a 20%                  of 2000 words, the size of the second layer will be 400
increase in accuracy compared to the first version [8].                   neurons. The number of neurons of the third layer is equal to
                                                                          three since we need to determine seven emotions.
    SenticNet is another semantic thesaurus for working with
sets of emotional concepts [9]. SenticNet is used to design                   After training the neural network, a test set is input. Each
intelligent applications for analyzing the emotional                      post of the set is also transformed into a vector based on the
component of the text. The main purpose of SenticNet is to                dictionary that was obtained during the training of the neural
simplify the process of machine recognition of conceptual                 network.
and emotional information that is transmitted using natural
language. The main difference between the considered                      A. Formal Description of the System
thesauruses is that SentiWordNet and WordNet-Affect                           Formally, the process of selecting posts can be
provide the linking of words and emotional concepts at the                represented by a flowchart in Figure 1. The flowchart
syntactic level and do not allow to reveal the semantic                   describes the process of selecting posts for the formation of a
component.                                                                training dataset. Each stage of the selection contains the
                                                                          processes of selecting posts for each specific emotion.
    Considered scientific works describe only general
recommendations for the formation of the training dataset
but do not provide methods or algorithms that would allow                                                   Start
the formation of a high-quality training dataset for sentiment
analysis in an automated mode. The accumulated knowledge
in the study of research can be used in the performance of                            Selection of posts based on dictionaries of
this work.                                                                            copyright symbols for expressing emotions
                III. MODELS AND ALGORITHMS
    The most popular method for creating a training dataset                              Selection of posts based on dictionaries of
is the selection by keywords and phrases. When using this                                               key phrases
method, dictionaries of copyright symbols of expression of
emotions and dictionaries of key phrases are used.
   Dictionaries of copyright symbols of expression of                                                       End
emotions were compiled by an expert. Each dictionary is
compiled for a specific emotion and contains several                      Fig. 1. Post selection process.
copyright symbols for expressing emotions. Dictionaries of
key phrases were found on the Internet and supplemented by                    At the first stage, posts are selected based on dictionaries
analyzing posts on the social network.                                    of copyright symbols of expression of emotions for each
    At the first stage, posts are selected based on dictionaries          class of emotional coloring of the text. In the second stage,
of copyright symbols for expressing emotions. As input                    posts are selected based on dictionaries of key phrases. In the
information, 2.5 million posts from the database are taken. If            third stage, posts are selected whose length is less than the
a post contains an author’s symbol for expressing emotions,               specified length. A length restriction was introduced because
then it belongs to a specific class and is added to the                   training the neural network in large posts reduces the
corresponding list.                                                       accuracy of recognition of the emotional coloring of the text
                                                                          [11].
    In the second stage, posts are selected based on
dictionaries of key phrases. The input information is the lists              Formally, a lot of dictionaries by which posts are selected
that were received at the previous stage. At this stage, the              can be represented by the formula (1)
lemmatization of each post word is performed. Then the post                                           D = {DE, DW}                     (1)
is checked for the content of each word from the dictionary.
If the post contains a phrase, then it belongs to a specific              where DE is a set of dictionaries with copyright symbols for
class of emotional coloring. At the output, the data is written           expressing emotions, DW - many dictionaries with keywords
into text files, each of which contains a training dataset of a           and phrases.
particular class of emotional coloring.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                 212
Data Science

   In turn, many dictionaries with copyright symbols for                  dictionaries with key phrases are read, then the posts are
expressing emotions can be represented by the formula (2)                 lemmatized and the key phrases are selected. Then the
                                                                          selected posts are saved in text files. After the formation of
DE = {DEjoy , DEsad , DEsurp , DEanger , DEdisg , DEcont , DEfear} (2)    the training dataset, training and testing of the accuracy of
where DEjoy – dictionary with emotion «joy», DEsad –                      determining the emotional coloring of posts by the neural
dictionary with emotion «sad», DEsurp – dictionary with                   network takes place.
emotion «surprise», DEanger – dictionary with emotion                         When building the software system, the following
«anger», DEdisg – dictionary with emotion «disgust», DEcont –             libraries were used.
dictionary with emotion «contempt», DEfear – dictionary with
emotion «fear».                                                               Lucene Russian Morphology is a library of
    In turn, many dictionaries with keywords can be                       morphological analysis [12]. This library performs a
represented by the formula (3)                                            morphological analysis of the word. The library allows you
                                                                          to perform lemmatization of the source word in Russian and
DW ={DWjoy , DWsad , DWsurp , DWanger , DWdisg , DWcont , DWfear} (3)     get information about part of speech. Lucene uses vocabulary
                                                                          base morphology with some heuristics for unknown words
where DWjoy – dictionary with emotion « joy», DWsad –                     and supports homonyms.
dictionary with emotion « sad», DWsurp – dictionary with
emotion «surprise», DWanger – dictionary with emotion                         Encog Machine Learning Framework is a machine
«anger», DWdisg – dictionary with emotion «disgust», DWcont               learning library [13]. The library supports various learning
– dictionary with emotion «contempt», DWfear – dictionary                 algorithms. The main advantage of the library is the neural
with emotion «fear».                                                      network algorithms. The library contains classes for creating
    Each process of selecting posts for a specific emotion is             a wide range of networks and supports classes for
associated with a dictionary with the author's symbols for                normalizing and processing data for these neural networks.
expressing emotions of DE and a dictionary of DW key                      Multithreading is used to provide optimal learning
phrases.                                                                  performance on multicore machines.

    The process of testing the training dataset can be                        PostgreSQL JDBC Driver is a library that provides
represented by a flowchart in Figure 2.                                   access to the PostgreSQL database [14]. The library provides
                                                                          a connection to the database and interaction with it. As
                                                                          parameters, the library accepts the database address and port,
                                  Start                                   login, and password for the connection. Further, the library
                                                                          receives SQL queries to the database input and returns the
                                                                          data.
                Creation of the vectors of the text
                                                                                                  V. EXPERIMENTS
                                                                              We will evaluate the quality of the generated training
                        Neural network training                           dataset as the accuracy of determining the emotional coloring
                                                                          of the text by a neural network.
                                                                              For the experiments, the following parameters were
               Assessing the accuracy of sentiment                        chosen: a different number of posts in the training set and
                        analysis of a text                                two methods of text processing - stemming and
                                                                          lemmatization. The accuracy of the system was measured at
                                                                          test posts, each of which belongs to one category.

                                 Finish                                      The quality of the training dataset will be defined as the
                                                                          number of correct conclusions divided by the number of test
                                                                          posts. The experimental results are shown in Table 1.
Fig. 2. Learning set validation process.
                                                                                TABLE I.        STEMMING AND LEMMATIZATION EXPERIMENTS
     At the first stage, a set of vectors is formed using the
word2vec algorithm. Next is the training of the neural                              Count posts       Stemming      Lemmatization
network. And then an assessment of the accuracy of                                     20                4/7            6/7
determining the emotional coloring of the text using a test                                50            6/7             7/7
set.
                                                                                         100             4/7             7/7
               IV. SOFTWARE IMPLEMENTATION                                               200             4/7             7/7
    To evaluate the effectiveness of the developed approach                              300             5/7             7/7
to the formation of the training dataset, a software system
was implemented.                                                              The experiments performed show that the training
                                                                          dataset, formed with the method of lemmatization, is
    The system reads data from the database, dictionaries                 obtained better than with the method of stemming. Table 1
with copyright symbols of expression of emotions and                      shows that the accuracy of the recognition of posts by a
keywords for each emotion, lemmatization, the formation of                neural network is much higher when a training dataset is
a training dataset and training the neural network.                       formed using the lemmatization method. The experimental
   First, dictionaries are read with copyright symbols for                results are also presented in the form of a graph in Figure 3.
expressing emotions, and then posts are selected. After that,


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                   213
Data Science

                                                                          The training dataset is created in an automated mode using
                                                                          dictionaries of copyright symbols for expressing emotions
                                                                          and dictionaries of key phrases. The neural network correctly
                                                                          determines the class of emotional coloring of the post with
                                                                          an accuracy of 67%. The neural network recognizes
                                                                          emotions of joy, sadness and disgust with an accuracy of
                                                                          75%.
                                                                              In the future, it is planned to improve the training dataset
                                                                          generation algorithm. Compiled dictionaries will be
                                                                          expanded and updated. To test the set, neural networks of
                                                                          various architectures, for example, deep learning, will be
                                                                          used.
Fig. 3. Stemming and lemmatization.
                                                                                                  ACKNOWLEDGMENT
    Additionally, 1,400 posts were submitted to the neural
network. 200 posts from each class. The experimental results                 This work was supported by the Russian Federal Property
are presented in Table 2.                                                 Fund. Projects No. 18-47-730035 and 18-47-732007.

                   TABLE II.     EXPERIMENT RESULTS                                                     REFERENCES
                                                                          [1]  Yu.V. Vizilter, V.S. Gorbatsevich and S.Y. Zheltov, “Structure-
           Emotion          Total            +            -                    functional analysis and synthesis of deep convolutional neural
        Joy               200          148          52                         networks,” Computer Optics, vol. 43, no. 5, pp. 886-900, 2019. DOI:
                                                                               10.18287/2412-6179-2019-43-5-886-900.
        Sad               200          154          46                    [2] D.A. Grishelenok and A. A. Kovel, “Using the results of
        Anger             200          110          90                         mathematical planning of an experiment in the formation of a training
                                                                               dataset of a neural network: article,” Krasnoyarsk: SibSAU, 2010.
        Surprise          200          126          74                    [3] I.L. Kaftannikov and A.V. Parasich, “Problems of forming a training
        Fear              200          101          99                         dataset in machine learning problems,” Bulletin of SUSU. Series
                                                                               Computer technology, control, electronics, vol. 16, no. 3, pp. 15-24,
        Disgust           200          151          49                         2016.
        Contempt          200          121          79                    [4] R.V. Posevkin and I.A. Immortal, “The use of sentiment analysis of
                                                                               texts to assess public opinion,” Scientific and Technical Journal of
        Sum:              1400         936          464                        Information Technologies, Mechanics, and Optics, vol. 15, no. 1, pp.
                                                                               169-171, 2015.
        Percent:                       0.,669       0.331
                                                                          [5] SentiFinder module [Online]. URL: eurekaengine.ru.
                                                                          [6] Thesaurus WordNet [Online]. URL: http://wndomains.fbk.eu/
    Experiments show that the neural network correctly                         wnaffect.html.
recognizes emotion with an accuracy of 67%. Best of all, a                [7] V. Moshkin, N. Yarushkina and I. Andreev, “The Sentiment Analysis
                                                                               of Unstructured Social Network Data Using the Extended Ontology
neural network determines joy, sadness and disgust with an                     SentiWordNet,”       IEEE 12th       International   Conference    on
accuracy of about 75%. The results of the experiment are                       Developments in eSystems Engineering (DeSE), Kazan, Russia, pp.
also presented in the form of a graph in Figure 5.                             576-580, 2019. DOI: 10.1109/DeSE.2019.00110.
                                                                          [8] Thesaurus SentiWordNet [Online]. URL: http://sentiwordnet.isti.cnr.
                                                                               it.
                                                                          [9] SenticNet Thesaurus [Online]. URL: https://sentic.net.
                                                                          [10] Word2Vec Algorithm [Online]. URL: https://neurohive.io/ru/.
                                                                          [11] I.A. Rycarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of media
                                                                               content from social networks using BigData technology,”Computer
                                                                               Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179-
                                                                               2018-42-5-921-927.
                                                                          [12] Library of morphological processing Russian Morphology: Russian
                                                                               [Online]. URL: https://github.com/AKuznetsov/russianmorphology.
                                                                          [13] Neural network library Encog Machine Learning Framework
                                                                               [Online]. URL: https://www.heatonresearch.com/encog/.
                                                                          [14] PostgreSQL JDBC Driver Database Access Library [Online]. URL:
Fig. 4. Experiment results.                                                    https://jdbc.postgresql.org /.

                         VI. CONCLUSION
    As a result of the robots, an expert system was developed
to determine the emotional coloring of social network posts.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                         214

</pre>