<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building a graph of a sequence of text units to create a sentence generation system</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maksim Kaminskiy</string-name>
          <email>beefiestracer@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor Rytsarev</string-name>
          <email>rycarev@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian Khotilin</string-name>
          <email>turbomax.1994@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Kupriyanov</string-name>
          <email>alexkupr@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Samara National Research University;, Image Processing Systems Institute of RAS, - Branch of the FSRC "Crystallography, and Photonics" RAS</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>121</fpage>
      <lpage>126</lpage>
      <abstract>
        <p>-The article is devoted to the development of a text data analysis system. The approaches to the presentation of text from the posts of a single page in the form of a dictionary of phrases for sentence generation and applying a developed system for correcting the results of neural network generation are considered. Within the framework of the work, data collection, filtering and processing using Big Data technologies were implemented.</p>
      </abstract>
      <kwd-group>
        <kwd>annotation</kwd>
        <kwd>social networks</kwd>
        <kwd>big data</kwd>
        <kwd>graph</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>The 'social network' notion was used by sociologists
back in the 1920s for investigating the interrelations
between participants of different communities. The
psychologist Iacob Moreno offered sociograms representing
graphs on which separate individuals were represented by
points, and interrelations between them – by lines. The idea
of using the apparatus of the theory of graphs for studying
interrelations between people was taken by specialists in
such areas as sociology, psychology, anthropology,
politology, economics – thus, the Social Network Analysis
flow was established, dealing with studying structural
properties of social interrelations modeled in the form of
graphs and networks. Building the model based on various
data from printed media, additional inquiries and
questioning was an important but rather time-consuming
stage of such investigation [1].</p>
      <p>
        Contemporary social networks substantially have made
the life of researchers easier, having presented to them the
developing and easily-accessible source of big data. Every
day the users of social networks generate large volumes of
data of different type. The analysis results of this
information may become a perfect material for
investigations of various fields [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For example, Social
Media Marketing (SMM), is an important tool for
promoting in the Internet for many companies. Social
networks are an environment in which all users
unconsciously work as focus groups, and do not hesitate to
share their opinions, argue, prove their case, express their
needs and wishes. Companies are constantly looking for
client insights that people share on social networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. One
of the tools for these studies is content analysis - a text
analysis method that is carried out by counting the
occurrence of components in the analyzed information, used
in sociology, as well as in computer technology. The
purpose of this method is to identify or measure various
facts and trends reflected in the investigated documents.
Using content analysis, it is possible to establish both the
characteristics of information sources and the characteristics
of the communication process. Content analysis can be used
to study most of the documentary sources, but it works best
with a relatively large amount of single-order data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Hence, it is so vital to be able to represent these data in the
form convenient for the efficient analysis.
      </p>
      <p>
        From a commercial point of view, the most successful
Natural-language generation (NLG) applications have been
data-to-text conversion systems that generate text
summaries of databases and datasets. These systems usually
perform data analysis as well as text generation. Research
has shown that text-based resumes can be more effective
than graphics and other visual elements for decision support,
and that computer-generated texts can outperform (from the
reader's point of view) human-written texts. There is
currently considerable commercial interest in using NLG to
aggregate financial and business data. Gartner has said that
NLG will become the standard tool for 90% of modern BI
(Business intelligence) and analytics platforms. NLG is also
used for commercial purposes for automated journalism,
chatbots, creating product descriptions for e-Commerce
sites, and compiling brief medical records [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The text annotating methods can be broken down into
two groups: extracting and generating. Among the
extracting methods of automatic annotating the method on
the basis of the theory of graphs can be distinguished, where
the text is presented as a graph, which nodes are text
fragments, and edges are relations among them [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>II. TASK SETTINGS</title>
      <p>The modern world is dynamic, computerized, the
employee is required to complete a task fast and qualitatively
to the greatest possible extent. The software that uses the
developed algorithm in its work can be used by employees
with the occupations, where it is necessary to print the text
for drawing up similar-in-content documents decreasing the
time spent for such task, or in organizations servicing the
citizens with disabilities (static and dynamic disorders of
upper limbs, visual impairments) on the quoted places, duties
of which are directly related to the work with computers.
Also, the software tool can be used in the field of education,
providing students with the opportunity to save time on
reporting on completed work. With its help, it is possible to
facilitate blogging on social networks for professional,
entertaining or educational purposes, as the algorithm
will
learn the style of the written texts and begin to suggest the
most suitable words for input.</p>
    </sec>
    <sec id="sec-3">
      <title>III. COLLECTION AND WORKING WITH DATA</title>
    </sec>
    <sec id="sec-4">
      <title>The algorithm</title>
      <p>developed in the framework of the
research, first of all, collects data, then filtering it in order to
obtain the crucial text information, then building a graph of
key words, when passing on the chains of words are built.</p>
    </sec>
    <sec id="sec-5">
      <title>Further on, if required, the system can be additionally improved adding new texts belonging to other authors, for style combining [7].</title>
      <p>One of the most known weblog platforms LiveJournal
was chosen as a source of data, which represents the
possibility of publishing own records and commenting on
others. This large resource abounds with weblogs on various
topics, being an excellent source of large volumes of text
information. All obtained information is stored in the text file
to work with after that.</p>
      <p>The data must be prepared for further work with the text.
Hyperlinks, emojis, punctuation marks, special characters are
filtered out, all other letters are converted to lowercase. The
words with the length less than four characters are filtered
out as well in order to exclude the majority of auxiliary parts
of speech. After that the text is structured into separate key
words. Lemmatization of tokens, i.e. reducing the words into
their
initial
form,
is
performed
after
that.</p>
    </sec>
    <sec id="sec-6">
      <title>Under</title>
      <p>lemmatization the parts of speech are transformed according
to the following type: nouns – singular, nominative case;
adjectives – singular, masculine, nominative case; verb –
indefinite form (infinitive). Example of lemmatization can be
seen on Figure 1.</p>
      <p>2 3 …   , a graph can be built. The nodes of the
graph will be key words   from the vocabulary 
edges connect them into phrases from the text. The number
, the
of repetitions among the words is given as the edge weight.</p>
      <p>When improving a new portion of processed information
is introduced to the graph. However, since the weight of a
new
bond primarily</p>
      <p>will be less than that of the bonds
already existed in the graph, for compensation purposes a
new
structure at every
node is introduced,
which is
represented as a stack of words (
=  1 2 3 …   , where

 is a word taken separately from the stack). It has the latest
bonds created after the node. The priority for output will be
given to new data, and the compensation of the low weight
of the bond will be performed by means of introducing a
coefficient  , which depends on the position of the word in
the stack, selecting which logic chains for two sets of data at
once can be built, but the second one will have some priority,
because it
was
used
for improving
the
system.</p>
      <p>The
summarized scheme of work of the described algorithm is
given on Figure 2.
49 coincidences the big difference between neighboring
nodes can be observed,</p>
      <p>making a conclusion that the
frequency of coincided words with the authors differs.</p>
      <p>Further on, we consider the total capacity of every node
from the graph. As seen from Figure 5 and 6, that the first
author frequently
uses certain
words (for example, the
variation of the word «быть» ("be")), while the usage of
words by the other author is more even.</p>
      <p>In the result of these comparisons it can be concluded
that the lexicon of the authors substantially differs regardless
the writing of articles on the topic alike.</p>
      <p>The developed algorithm
was also applied to texts
generated by GRU and LSTM neural networks to eliminate
word</p>
      <p>
        errors and increase contextual connectivity. As a
dataset for training neural networks of text generation, a text
consisting of speeches of the characters of Shakespeare's
plays was taken. To check the generated texts, the GLTR
(Giant Language model Test Room) was selected, which is a
tool for detecting text that was automatically generated. This
instrument can use any text data and analyze what language
model GPT-2 would predict in each position. Each text is
analyzed according to how likely each word will be a
predicted word, taking into account the context on the left. If
the actual word used would be in the top 10 predicted words,
the background is colored in green, for the top 100 in yellow,
the top 1000 in red, otherwise in purple. On Figures 7 and 8,
you can see the results of the analysis of texts, and Figures 9
and 10 show histograms where the number of predictions for
each of the texts is calculated [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>As can be seen from the result of the generation, the text
generated by GRU turned out to be not too contextually
connected and 7 words were displayed with errors, in the
LSTM generated text there are slightly more errors in words
– 9, but according to the data from the histograms, it
surpassed the previous neural network.</p>
      <p>To correct the received texts, an algorithm was developed
that, in conjunction with the constructed graph, allows to
correct errors in words and increase contextual connectivity.</p>
      <p>To build a graph, the text on which the neural networks
were trained was used (Figure 11).
formula:</p>
      <p>Then the triples of words ℎ , ℎ +1 и ℎ +2. are examined.
Since
we are
working
with
a large text data set, the
“windows” of the three
words
considered
during
the
operation of the algorithm are enough to correct the text.
Each word ℎ in the sentence is checked for its presence in
the column, and then the
words associated
with it are
considered, if there is no word in the column, then we shift
our “window” by 1 step. Then check for the presence of
related words ℎ +1. If ℎ +1 is in the list, then we shift the
“window” by 1 step and continue checking. If not, then look
at the word ℎ +2. We check the presence of this word in the
graph; if it is absent, we shift the “window” by 3 steps. If the
word is found, then we check through which words to which
the links from</p>
      <p>h_i depart, it is possible to establish a
connection with ℎ +2 . If there are connecting words, put
ℎ +1in place of one of them, if not, then shift the “window”
by 2 steps. The choice of words that can be put in place of
ℎ +1 is carried out by calculating the probability by the
  =
  ,

∑ =1  
where   is the weight of the edge between ℎ +1 and the
word connecting it to ℎ i,   is the probability of choice and
 is the number of connected words. A generalized scheme
of the described algorithm is presented on Figure 12.
(1)</p>
      <p>An algorithm was also developed that, when used in
conjunction with a graph of keywords, can be applied to texts
generated by neural networks to correct incorrectly generated
words and increase contextual connectivity. In addition to
this, this algorithm improved the results of the analysis of the
GLTR text.</p>
      <p>This application has an extensive scope: for example, in
the current situation, given the unfavorable epidemic
situation caused by coronavirus infection, the majority of the
population have a need to master the work remotely. At the
same time, the specificity of each sphere implies a certain
terminology, a set of the most used speech turnovers when
creating a product description, correspondence with
customers and partners. This program will simplify and
accelerate the work on a set of textual material, which in turn
will increase the productivity of the labor process, help to
more effectively deal with deadlines (which, by the way,
have already entered the norm of modern life).</p>
      <p>Also, this program can provide assistance in the
preparation of advertising articles, political campaign
materials. Allowing you to analyze large textual volumes (for
example, articles on the Internet or in print media) to
determine the intentions, psychological state of target groups,
identify attitudes, interests and values, belief systems by
highlighting the most commonly used expressions and turns.
Subsequently, relying on these stable constructions, using
them in composing his own texts, the author acts between the
lines on the readers unconscious mind, letting him know that
they speak the same language, their problems and ideals are
the same, thereby increasing the level of openness for the
information presented and trust in it.</p>
      <p>But there is also a category of people who find it difficult
to type texts on a computer keyboard due to limited health
capabilities. For example, a person with spastic disorders in
the upper extremities who works on a PC. Each movement is
much more difficult for him, with greater efforts than a
conditionally healthy one, and besides, his exhaustion and
fatigue comes much faster. And here, the use of the
application will act as a significant assistant, allowing you to
minimize arbitrary movements, therefore, the energy
expended.</p>
    </sec>
    <sec id="sec-7">
      <title>Thus, the first algorithm presented:</title>
      <p>

</p>
      <p>Simplifies typing, as it learns the style of the author’s
letters and suggests the most appropriate words for
subsequent input;
Saves time, because instead of manual typing, you
can use the options of the displayed words provided
by the algorithm, which partially automates the
process of working with text;
Increases productivity, reducing time costs, making it
possible to do more work in the same amount of time.</p>
      <p>As a result, the totality of these advantages allows you to
increase productivity.</p>
    </sec>
    <sec id="sec-8">
      <title>And the second one algorithm:</title>
      <p></p>
      <p>Fixes errors in incorrectly generated
replacing them;
words by
 Increases the contextual coherence of the text by
replacing words with those that have associations
within the graph, and which are met in the original
text, written by a human.</p>
      <p>This transformation brings the text closer to the style of
the author, allowing the text to look less similar to the one
that was compiled by a machine.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENT</title>
      <p>The work is done with the financial support from the
Russian Foundation for Basic Research (No. 18-37-00418,
No. 19-29-01135, No. 19-31-90160) and the Ministry of
Science and Higher Education of the Russian Federation
(grant # 0777-2020-0017) in the framework of fulfilling the
governmental task of Samara University and FSRC
"Crystallography and Photonics" of RAS.</p>
      <p>URL:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>W.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Blake</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Saleh</surname>
          </string-name>
          , “
          <article-title>Social-Network-Sourced Big Data Analytics,” Open systems</article-title>
          . DBMS, no.
          <issue>8</issue>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.A</given-names>
            <surname>Rytsarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.V.</given-names>
            <surname>Kirish</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Kupriyanov</surname>
          </string-name>
          , “
          <article-title>Clustering of media content from social networks using bigdata technology</article-title>
          ,
          <source>” Computer Optics</source>
          , vol.
          <volume>42</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>921</fpage>
          -
          <lpage>927</lpage>
          ,
          <year>2018</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>2412</fpage>
          -6179- 2018-42-5-
          <fpage>921</fpage>
          -927.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] “Social network analytics: 10 ways to use monitoring systems</article-title>
          ,” YouScan - Social
          <source>Media Monitoring System</source>
          ,
          <year>2019</year>
          [Online]. URL: https://youscan.io/ru/blog/10-instrumentov-analiza-socsetei/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.V.</given-names>
            <surname>Dmitriev</surname>
          </string-name>
          , “
          <article-title>Content analysis: essence, tasks</article-title>
          , procedures,” PSIFACTOR. - Center for Practical Psycology,
          <year>2005</year>
          [Online]. URL: https://psyfactor.org/lib/k-a.htm.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] “Natural-language generation</article-title>
          ,” Wikipedia [Online]. https://en.wikipedia.org/wiki/Natural-language_generation.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.G.</given-names>
            <surname>Osminin</surname>
          </string-name>
          , “Modern approaches to automatic summarization,” Bulletin of South Ural State University. Series: Linguistics, no.
          <issue>25</issue>
          , pp.
          <fpage>134</fpage>
          -
          <lpage>135</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.A.</given-names>
            <surname>Rytsarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Blagov</surname>
          </string-name>
          and
          <string-name>
            <surname>M.I. Khotilin</surname>
          </string-name>
          , “
          <article-title>Development and implementation of services to collect social networking data in order to improve the human environment,” Collected papers of ITNT. Information technologies and nanotechnologies</article-title>
          , pp.
          <fpage>2452</fpage>
          -
          <lpage>2457</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Strobelt</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          , “
          <article-title>Catching a Unicorn with GLTR: A tool to detect automatically generated text,” Catching Unicorns with</article-title>
          GLTR,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>