<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An approach to the training dataset formation for assessing the sentiment degree of social network posts using machine learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrey Konstantinov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ulyanovsk State Technical University Ulyanovsk</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>211</fpage>
      <lpage>214</lpage>
      <abstract>
        <p>-This article describes an approach to the formation of a training set for assessing the emotional coloring of social network posts. The dataset is formed in an automated mode. The input values of the algorithm are 2.5 million posts of a social network, the output values is a training set neural network. The algorithm for the formation of the training set is based on selection using copyright symbols for expressing emotions and key phrases. The quality of the training set is checked during the training of the multilayer perceptron by the set obtained and experiments. The accuracy of determining the emotional coloring of posts of a social network by a neural network is about 67%.</p>
      </abstract>
      <kwd-group>
        <kwd>data analysis</kwd>
        <kwd>sentiment language processing</kwd>
        <kwd>social network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>The study of social networks is becoming increasingly
important every year due to the growing need to ensure
public safety and monitor public sentiment. An analysis of
posts can help assess changes in the mood of many users and
find application in political and social studies, including
consumer research.</p>
      <p>Currently, neural networks are used to solve various
problems in the field of intelligent data processing. The
deployment of a neural network is carried out in two stages.

</p>
    </sec>
    <sec id="sec-2">
      <title>Choice of neural network architecture.</title>
    </sec>
    <sec id="sec-3">
      <title>Creation of a training dataset [1].</title>
      <p>The training dataset preparation phase takes a lot of time.
In many cases, the expert analyzes and generates a training
dataset in manual mode and spends a lot of time.</p>
      <p>The purpose of this work is to develop an experimental
model of a software system for determining the emotional
coloring of posts on a social network based on copyright
symbols for expressing emotions.</p>
    </sec>
    <sec id="sec-4">
      <title>The main tasks are presented below.</title>
      <p>


</p>
      <p>Analysis of the subject area, which includes the
determination of the source data for the formation
of the training dataset and classes of emotional
coloring of posts;
A review of existing solutions and studies that were
proposed by Russian researchers;
Development of a methodology for the formation of
a training dataset, which is based on the methods of
linguistic analysis of text information;</p>
    </sec>
    <sec id="sec-5">
      <title>Software implementation;</title>
      <p></p>
      <p>Conducting experiments that show the effectiveness
of determining the emotional coloring of a post with
a trained neural network.</p>
      <p>
        Also in the course of work, software systems and
modules that perform sentiment analysis of texts were
considered. The SentiFinder module [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] defines three types
of tonality of Russian-language texts: positive, negative and
neutral. Tonality is defined relative to a given tonality object
within a single sentence or throughout a document. The
average accuracy for the three types of tonality is about 87%.
      </p>
      <p>
        There are some thesauruses specifically marked out
taking into account the emotional component. Such
dictionaries are necessary for computer programs in the
analysis of the tonality of the text. WordNet-Affect is a
semantic thesaurus in which concepts are associated with
emotions and are represented using words with an emotional
component [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. WordNet-Affect also uses additional
emotional labels to separate synsets according to their
emotional valency. To do this, four additional emotional
labels are defined: positive, negative, ambiguous, and
neutral.
      </p>
      <p>
        SentiWordNet is a lexical-semantic thesaurus, the first
version of which was developed in 2006 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This system is
the result of the process of automatic annotation of a set of
synonyms by its degree of positivity, negativity and
objectivity. Using SentiWordNet provides more than a 20%
increase in accuracy compared to the first version [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        SenticNet is another semantic thesaurus for working with
sets of emotional concepts [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. SenticNet is used to design
intelligent applications for analyzing the emotional
component of the text. The main purpose of SenticNet is to
simplify the process of machine recognition of conceptual
and emotional information that is transmitted using natural
language. The main difference between the considered
thesauruses is that SentiWordNet and WordNet-Affect
provide the linking of words and emotional concepts at the
syntactic level and do not allow to reveal the semantic
component.
      </p>
      <p>Considered scientific works describe only general
recommendations for the formation of the training dataset
but do not provide methods or algorithms that would allow
the formation of a high-quality training dataset for sentiment
analysis in an automated mode. The accumulated knowledge
in the study of research can be used in the performance of
this work.</p>
    </sec>
    <sec id="sec-6">
      <title>III. MODELS AND ALGORITHMS</title>
      <p>The most popular method for creating a training dataset
is the selection by keywords and phrases. When using this
method, dictionaries of copyright symbols of expression of
emotions and dictionaries of key phrases are used.</p>
      <p>Dictionaries of copyright symbols of expression of
emotions were compiled by an expert. Each dictionary is
compiled for a specific emotion and contains several
copyright symbols for expressing emotions. Dictionaries of
key phrases were found on the Internet and supplemented by
analyzing posts on the social network.</p>
      <p>At the first stage, posts are selected based on dictionaries
of copyright symbols for expressing emotions. As input
information, 2.5 million posts from the database are taken. If
a post contains an author’s symbol for expressing emotions,
then it belongs to a specific class and is added to the
corresponding list.</p>
      <p>In the second stage, posts are selected based on
dictionaries of key phrases. The input information is the lists
that were received at the previous stage. At this stage, the
lemmatization of each post word is performed. Then the post
is checked for the content of each word from the dictionary.
If the post contains a phrase, then it belongs to a specific
class of emotional coloring. At the output, the data is written
into text files, each of which contains a training dataset of a
particular class of emotional coloring.
Fig. 1. Post selection process.</p>
      <p>
        At the first stage, posts are selected based on dictionaries
of copyright symbols of expression of emotions for each
class of emotional coloring of the text. In the second stage,
posts are selected based on dictionaries of key phrases. In the
third stage, posts are selected whose length is less than the
specified length. A length restriction was introduced because
training the neural network in large posts reduces the
accuracy of recognition of the emotional coloring of the text
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Formally, a lot of dictionaries by which posts are selected
can be represented by the formula (1)</p>
      <p>D = {DE, DW}
where DE is a set of dictionaries with copyright symbols for
expressing emotions, DW - many dictionaries with keywords
and phrases.
(1)</p>
      <p>
        A neural network only works with vectors, so texts must
be represented in vector form. To represent the training
dataset in the form of vectors, the word2vec algorithm was
used [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Initially, a list of all the words in the posts is
compiled. Previously, all words were reduced to the initial
form using lemmatization. Then, vectors are created whose
size is equal to the size of the list of all words. After the
vector is set to 1 if the word occurs in the post, otherwise 0 if
not.
      </p>
      <p>A multilayer perceptron with three layers was used as a
neural network. The number of neurons in the first layer is
equal to the size of the list of all dictionary words. The
number of neurons in the second layer is equal to the size of
the first divided by 50. The size of the second layer was
selected by conducting many experiments. For a dictionary
of 2000 words, the size of the second layer will be 400
neurons. The number of neurons of the third layer is equal to
three since we need to determine seven emotions.</p>
      <p>After training the neural network, a test set is input. Each
post of the set is also transformed into a vector based on the
dictionary that was obtained during the training of the neural
network.</p>
      <p>A. Formal Description of the System</p>
      <p>Formally, the process of selecting posts can be
represented by a flowchart in Figure 1. The flowchart
describes the process of selecting posts for the formation of a
training dataset. Each stage of the selection contains the
processes of selecting posts for each specific emotion.</p>
    </sec>
    <sec id="sec-7">
      <title>Start</title>
      <p>End</p>
    </sec>
    <sec id="sec-8">
      <title>Selection of posts based on dictionaries of copyright symbols for expressing emotions</title>
    </sec>
    <sec id="sec-9">
      <title>Selection of posts based on dictionaries of key phrases</title>
      <p>In turn, many dictionaries with copyright symbols for
expressing emotions can be represented by the formula (2)
DE = {DEjoy , DEsad , DEsurp , DEanger , DEdisg , DEcont , DEfear} (2)
where DEjoy – dictionary with emotion «joy», DEsad –
dictionary with emotion «sad», DEsurp – dictionary with
emotion «surprise», DEanger – dictionary with emotion
«anger», DEdisg – dictionary with emotion «disgust», DEcont –
dictionary with emotion «contempt», DEfear – dictionary with
emotion «fear».</p>
      <p>In turn, many dictionaries with keywords can be
represented by the formula (3)
DW ={DWjoy , DWsad , DWsurp , DWanger , DWdisg , DWcont , DWfear} (3)
where DWjoy – dictionary with emotion « joy», DWsad –
dictionary with emotion « sad», DWsurp – dictionary with
emotion «surprise», DWanger – dictionary with emotion
«anger», DWdisg – dictionary with emotion «disgust», DWcont
– dictionary with emotion «contempt», DWfear – dictionary
with emotion «fear».</p>
      <p>Each process of selecting posts for a specific emotion is
associated with a dictionary with the author's symbols for
expressing emotions of DE and a dictionary of DW key
phrases.</p>
      <p>The process of testing the training dataset can be
represented by a flowchart in Figure 2.</p>
    </sec>
    <sec id="sec-10">
      <title>Creation of the vectors of the text</title>
    </sec>
    <sec id="sec-11">
      <title>Neural network training</title>
    </sec>
    <sec id="sec-12">
      <title>Assessing the accuracy of sentiment analysis of a text Start</title>
    </sec>
    <sec id="sec-13">
      <title>Finish</title>
      <p>At the first stage, a set of vectors is formed using the
word2vec algorithm. Next is the training of the neural
network. And then an assessment of the accuracy of
determining the emotional coloring of the text using a test
set.</p>
    </sec>
    <sec id="sec-14">
      <title>IV. SOFTWARE IMPLEMENTATION To evaluate the effectiveness of the developed approach to the formation of the training dataset, a software system was implemented.</title>
      <p>The system reads data from the database, dictionaries
with copyright symbols of expression of emotions and
keywords for each emotion, lemmatization, the formation of
a training dataset and training the neural network.</p>
      <p>First, dictionaries are read with copyright symbols for
expressing emotions, and then posts are selected. After that,
dictionaries with key phrases are read, then the posts are
lemmatized and the key phrases are selected. Then the
selected posts are saved in text files. After the formation of
the training dataset, training and testing of the accuracy of
determining the emotional coloring of posts by the neural
network takes place.</p>
      <p>When building the software system, the following
libraries were used.</p>
      <p>Lucene Russian Morphology is a library of
morphological analysis [12]. This library performs a
morphological analysis of the word. The library allows you
to perform lemmatization of the source word in Russian and
get information about part of speech. Lucene uses vocabulary
base morphology with some heuristics for unknown words
and supports homonyms.</p>
      <p>Encog Machine Learning Framework is a machine
learning library [13]. The library supports various learning
algorithms. The main advantage of the library is the neural
network algorithms. The library contains classes for creating
a wide range of networks and supports classes for
normalizing and processing data for these neural networks.
Multithreading is used to provide optimal learning
performance on multicore machines.</p>
      <p>PostgreSQL JDBC Driver is a library that provides
access to the PostgreSQL database [14]. The library provides
a connection to the database and interaction with it. As
parameters, the library accepts the database address and port,
login, and password for the connection. Further, the library
receives SQL queries to the database input and returns the
data.</p>
    </sec>
    <sec id="sec-15">
      <title>V. EXPERIMENTS</title>
      <p>We will evaluate the quality of the generated training
dataset as the accuracy of determining the emotional coloring
of the text by a neural network.</p>
      <p>For the experiments, the following parameters were
chosen: a different number of posts in the training set and
two methods of text processing - stemming and
lemmatization. The accuracy of the system was measured at
test posts, each of which belongs to one category.</p>
      <p>The quality of the training dataset will be defined as the
number of correct conclusions divided by the number of test
posts. The experimental results are shown in Table 1.</p>
      <p>The experiments performed show that the training
dataset, formed with the method of lemmatization, is
obtained better than with the method of stemming. Table 1
shows that the accuracy of the recognition of posts by a
neural network is much higher when a training dataset is
formed using the lemmatization method. The experimental
results are also presented in the form of a graph in Figure 3.</p>
      <p>Additionally, 1,400 posts were submitted to the neural
network. 200 posts from each class. The experimental results
are presented in Table 2.
Percent:</p>
      <p>Experiments show that the neural network correctly
recognizes emotion with an accuracy of 67%. Best of all, a
neural network determines joy, sadness and disgust with an
accuracy of about 75%. The results of the experiment are
also presented in the form of a graph in Figure 5.</p>
      <p>Fig. 4. Experiment results.</p>
      <p>VI. CONCLUSION</p>
      <p>As a result of the robots, an expert system was developed
to determine the emotional coloring of social network posts.
The training dataset is created in an automated mode using
dictionaries of copyright symbols for expressing emotions
and dictionaries of key phrases. The neural network correctly
determines the class of emotional coloring of the post with
an accuracy of 67%. The neural network recognizes
emotions of joy, sadness and disgust with an accuracy of
75%.</p>
      <p>In the future, it is planned to improve the training dataset
generation algorithm. Compiled dictionaries will be
expanded and updated. To test the set, neural networks of
various architectures, for example, deep learning, will be
used.</p>
    </sec>
    <sec id="sec-16">
      <title>ACKNOWLEDGMENT</title>
      <p>This work was supported by the Russian Federal Property
Fund. Projects No. 18-47-730035 and 18-47-732007.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Yu.V.</given-names>
            <surname>Vizilter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.S.</given-names>
            <surname>Gorbatsevich</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.Y.</given-names>
            <surname>Zheltov</surname>
          </string-name>
          , “
          <article-title>Structurefunctional analysis and synthesis of deep convolutional neural networks</article-title>
          ,
          <source>” Computer Optics</source>
          , vol.
          <volume>43</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>886</fpage>
          -
          <lpage>900</lpage>
          ,
          <year>2019</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>2412</fpage>
          -6179-2019-43-5-
          <fpage>886</fpage>
          -900.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.A.</given-names>
            <surname>Grishelenok</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Kovel</surname>
          </string-name>
          , “
          <article-title>Using the results of mathematical planning of an experiment in the formation of a training dataset of a neural network: article</article-title>
          ,” Krasnoyarsk: SibSAU,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.L.</given-names>
            <surname>Kaftannikov</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Parasich</surname>
          </string-name>
          , “
          <article-title>Problems of forming a training dataset in machine learning problems</article-title>
          ,” Bulletin of SUSU. Series Computer technology, control, electronics, vol.
          <volume>16</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.V.</given-names>
            <surname>Posevkin</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.A.</given-names>
            <surname>Immortal</surname>
          </string-name>
          , “
          <article-title>The use of sentiment analysis of texts to assess public opinion</article-title>
          ,
          <source>” Scientific and Technical Journal of Information Technologies, Mechanics, and Optics</source>
          , vol.
          <volume>15</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>171</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] SentiFinder module [Online]</article-title>
          .
          <source>URL: eurekaengine.ru.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Thesaurus</surname>
            <given-names>WordNet</given-names>
          </string-name>
          [Online].
          <source>wnaffect.html.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Moshkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yarushkina</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Andreev</surname>
          </string-name>
          , “
          <article-title>The Sentiment Analysis of Unstructured Social Network Data Using the Extended Ontology SentiWordNet</article-title>
          ,” IEEE 12th International Conference on Developments in eSystems Engineering (DeSE), Kazan, Russia, pp.
          <fpage>576</fpage>
          -
          <lpage>580</lpage>
          ,
          <year>2019</year>
          . DOI:
          <volume>10</volume>
          .1109/DeSE.
          <year>2019</year>
          .
          <volume>00110</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Thesaurus</surname>
            <given-names>SentiWordNet</given-names>
          </string-name>
          [Online]. URL: http://sentiwordnet.isti.cnr. it.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>SenticNet</given-names>
            <surname>Thesaurus</surname>
          </string-name>
          [Online]. URL: https://sentic.net.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Word2Vec</given-names>
            <surname>Algorithm</surname>
          </string-name>
          [Online]. URL: https://neurohive.io/ru/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.A.</given-names>
            <surname>Rycarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.V.</given-names>
            <surname>Kirsh</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Kupriyanov</surname>
          </string-name>
          , “
          <article-title>Clustering of media content from social networks using BigData technology</article-title>
          ,”
          <source>Computer Optics</source>
          , vol.
          <volume>42</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>921</fpage>
          -
          <lpage>927</lpage>
          ,
          <year>2018</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>2412</fpage>
          -6179- 2018-42-5-
          <fpage>921</fpage>
          -927.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>