=Paper=
{{Paper
|id=Vol-2667/paper27
|storemode=property
|title=Building a graph of a sequence of text units to create a sentence generation system 
|pdfUrl=https://ceur-ws.org/Vol-2667/paper27.pdf
|volume=Vol-2667
|authors=Maksim Kaminskiy,Igor Rytsarev,Alexander Kupriyanov,Maximilian Khotilin
}}
==Building a graph of a sequence of text units to create a sentence generation system ==
<pdf width="1500px">https://ceur-ws.org/Vol-2667/paper27.pdf</pdf>
<pre>
         Building a graph of a sequence of text units to
              create a sentence generation system
         Maksim Kaminskiy                                          Igor Rytsarev                               Alexander Kupriyanov
 Samara National Research University                 Samara National Research University;              Samara National Research University;
           Samara, Russia                            Image Processing Systems Institute of RAS         Image Processing Systems Institute of RAS
      beefiestracer@gmail.com                         - Branch of the FSRC "Crystallography             - Branch of the FSRC "Crystallography
                                                               and Photonics" RAS                                and Photonics" RAS
                                                                 Samara, Russia                                    Samara, Russia
                                                              rycarev@gmail.com                                alexkupr@gmail.com

                                                              Maximilian Khotilin
                                                      Samara National Research University
                                                                Samara, Russia
                                                          turbomax.1994@gmail.com

    Abstract—The article is devoted to the development of a                  purpose of this method is to identify or measure various
text data analysis system. The approaches to the presentation                facts and trends reflected in the investigated documents.
of text from the posts of a single page in the form of a                     Using content analysis, it is possible to establish both the
dictionary of phrases for sentence generation and applying a                 characteristics of information sources and the characteristics
developed system for correcting the results of neural network
                                                                             of the communication process. Content analysis can be used
generation are considered. Within the framework of the work,
data collection, filtering and processing using Big Data                     to study most of the documentary sources, but it works best
technologies were implemented.                                               with a relatively large amount of single-order data [4].
                                                                             Hence, it is so vital to be able to represent these data in the
   Keywords—annotation, social networks, big data, graph,                    form convenient for the efficient analysis.
machine learning
                                                                                 From a commercial point of view, the most successful
                        I. INTRODUCTION                                      Natural-language generation (NLG) applications have been
    The 'social network' notion was used by sociologists                     data-to-text conversion systems that generate text
back in the 1920s for investigating the interrelations                       summaries of databases and datasets. These systems usually
between participants of different communities. The                           perform data analysis as well as text generation. Research
psychologist Iacob Moreno offered sociograms representing                    has shown that text-based resumes can be more effective
graphs on which separate individuals were represented by                     than graphics and other visual elements for decision support,
points, and interrelations between them – by lines. The idea                 and that computer-generated texts can outperform (from the
of using the apparatus of the theory of graphs for studying                  reader's point of view) human-written texts. There is
interrelations between people was taken by specialists in                    currently considerable commercial interest in using NLG to
such areas as sociology, psychology, anthropology,                           aggregate financial and business data. Gartner has said that
politology, economics – thus, the Social Network Analysis                    NLG will become the standard tool for 90% of modern BI
flow was established, dealing with studying structural                       (Business intelligence) and analytics platforms. NLG is also
properties of social interrelations modeled in the form of                   used for commercial purposes for automated journalism,
graphs and networks. Building the model based on various                     chatbots, creating product descriptions for e-Commerce
data from printed media, additional inquiries and                            sites, and compiling brief medical records [5].
questioning was an important but rather time-consuming                           The text annotating methods can be broken down into
stage of such investigation [1].                                             two groups: extracting and generating. Among the
    Contemporary social networks substantially have made                     extracting methods of automatic annotating the method on
the life of researchers easier, having presented to them the                 the basis of the theory of graphs can be distinguished, where
developing and easily-accessible source of big data. Every                   the text is presented as a graph, which nodes are text
day the users of social networks generate large volumes of                   fragments, and edges are relations among them [6].
data of different type. The analysis results of this                                             II. TASK SETTINGS
information may become a perfect material for
investigations of various fields [2]. For example, Social                        The modern world is dynamic, computerized, the
Media Marketing (SMM), is an important tool for                              employee is required to complete a task fast and qualitatively
                                                                             to the greatest possible extent. The software that uses the
promoting in the Internet for many companies. Social
                                                                             developed algorithm in its work can be used by employees
networks are an environment in which all users                               with the occupations, where it is necessary to print the text
unconsciously work as focus groups, and do not hesitate to                   for drawing up similar-in-content documents decreasing the
share their opinions, argue, prove their case, express their                 time spent for such task, or in organizations servicing the
needs and wishes. Companies are constantly looking for                       citizens with disabilities (static and dynamic disorders of
client insights that people share on social networks [3]. One                upper limbs, visual impairments) on the quoted places, duties
of the tools for these studies is content analysis - a text                  of which are directly related to the work with computers.
analysis method that is carried out by counting the                          Also, the software tool can be used in the field of education,
occurrence of components in the analyzed information, used                   providing students with the opportunity to save time on
in sociology, as well as in computer technology. The                         reporting on completed work. With its help, it is possible to

Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

facilitate blogging on social networks for professional,                      When improving a new portion of processed information
entertaining or educational purposes, as the algorithm will               is introduced to the graph. However, since the weight of a
learn the style of the written texts and begin to suggest the             new bond primarily will be less than that of the bonds
most suitable words for input.                                            already existed in the graph, for compensation purposes a
                                                                          new structure at every node is introduced, which is
         III. COLLECTION AND WORKING WITH DATA                            represented as a stack of words (𝐾 = 𝑘1 𝑘2 𝑘3 … 𝑘𝑚 , where
    The algorithm developed in the framework of the                       𝑘𝑗 is a word taken separately from the stack). It has the latest
research, first of all, collects data, then filtering it in order to      bonds created after the node. The priority for output will be
obtain the crucial text information, then building a graph of             given to new data, and the compensation of the low weight
key words, when passing on the chains of words are built.                 of the bond will be performed by means of introducing a
Further on, if required, the system can be additionally                   coefficient 𝑠, which depends on the position of the word in
improved adding new texts belonging to other authors, for                 the stack, selecting which logic chains for two sets of data at
style combining [7].                                                      once can be built, but the second one will have some priority,
                                                                          because it was used for improving the system. The
    One of the most known weblog platforms LiveJournal
                                                                          summarized scheme of work of the described algorithm is
was chosen as a source of data, which represents the
                                                                          given on Figure 2.
possibility of publishing own records and commenting on
others. This large resource abounds with weblogs on various
topics, being an excellent source of large volumes of text
information. All obtained information is stored in the text file
to work with after that.
     The data must be prepared for further work with the text.
Hyperlinks, emojis, punctuation marks, special characters are
filtered out, all other letters are converted to lowercase. The
words with the length less than four characters are filtered
out as well in order to exclude the majority of auxiliary parts
of speech. After that the text is structured into separate key
words. Lemmatization of tokens, i.e. reducing the words into
their initial form, is performed after that. Under
lemmatization the parts of speech are transformed according
to the following type: nouns – singular, nominative case;
adjectives – singular, masculine, nominative case; verb –
indefinite form (infinitive). Example of lemmatization can be
seen on Figure 1.                                                         Fig. 2. Schematic representation of work of algorithm.

                                                                               IV. COMPARING OF STYLES OF DIFFERENT AUTHORS
                                                                              Two posts dedicated to the "Cats" screen musical by two
                                                                          different authors were taken for the research. Having
                                                                          obtained the text, the data were filtered and processed for
                                                                          making the vocabulary of key words and matrix of phrases.
                                                                          Two weighted graphs were built after that, and they are
                                                                          presented on Figure 3 and 4.
                                                                              The first graph has 297 nodes and 385 edges, the second
                                                                          one has 296 nodes and 384 edges. Having compared them
                                                                          totally 49 nodes having the same name were found. For these
                                                                          49 coincidences the big difference between neighboring
                                                                          nodes can be observed, making a conclusion that the
                                                                          frequency of coincided words with the authors differs.
                                                                              Further on, we consider the total capacity of every node
                                                                          from the graph. As seen from Figure 5 and 6, that the first
                                                                          author frequently uses certain words (for example, the
                                                                          variation of the word «быть» ("be")), while the usage of
Fig. 1. Example of transformation of words into lemma.                    words by the other author is more even.
    The vocabulary is created after these transformations,                    In the result of these comparisons it can be concluded
arranged by the frequency of application of key words, based              that the lexicon of the authors substantially differs regardless
on which the phrase matrix is built, the terms of which there             the writing of articles on the topic alike.
will be a number of repetitions among the words in the text.                  The developed algorithm was also applied to texts
Further on, having the phrase matrix and the vocabulary                   generated by GRU and LSTM neural networks to eliminate
𝑊 = 𝑤1 𝑤2 𝑤3 … 𝑤𝑛 , a graph can be built. The nodes of the                word errors and increase contextual connectivity. As a
graph will be key words 𝑤𝑖 from the vocabulary 𝑊 , the                    dataset for training neural networks of text generation, a text
edges connect them into phrases from the text. The number                 consisting of speeches of the characters of Shakespeare's
of repetitions among the words is given as the edge weight.               plays was taken. To check the generated texts, the GLTR


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                122
Data Science

(Giant Language model Test Room) was selected, which is a                         To correct the received texts, an algorithm was developed
tool for detecting text that was automatically generated. This                that, in conjunction with the constructed graph, allows to
instrument can use any text data and analyze what language                    correct errors in words and increase contextual connectivity.
model GPT-2 would predict in each position. Each text is
analyzed according to how likely each word will be a
predicted word, taking into account the context on the left. If
the actual word used would be in the top 10 predicted words,
the background is colored in green, for the top 100 in yellow,
the top 1000 in red, otherwise in purple. On Figures 7 and 8,
you can see the results of the analysis of texts, and Figures 9
and 10 show histograms where the number of predictions for
each of the texts is calculated [7].


                                                                              Fig. 5. Capacity of nodes of first graph.


Fig. 3. Simplified representation of the graph drawn by the "Musical
"Cats" post in cinema: mutants not able to sing", created by the user named
shakko_kitsune.


                                                                              Fig. 6. Capacity of nodes of second graph.


Fig. 4. Simplified representation of the graph drawn by the "Cats: purring
musical", created by the user named carabas.

            V. CORRECTION OF GENERATED TEXTS
   As can be seen from the result of the generation, the text
generated by GRU turned out to be not too contextually
connected and 7 words were displayed with errors, in the
LSTM generated text there are slightly more errors in words
– 9, but according to the data from the histograms, it
surpassed the previous neural network.
                                                                              Fig. 7. GRU text generation result.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                 123
Data Science
                                                                                                                 𝑔
                                                                                                       𝑝𝑖 = ∑𝑁 𝑖 ,                          (1)
                                                                                                               𝑗=1 𝑔𝑗
                                                                          where 𝑔𝑖 is the weight of the edge between ℎ𝑖+1 and the
                                                                          word connecting it to ℎ𝑖 i, 𝑝𝑖 is the probability of choice and
                                                                          𝑁 is the number of connected words. A generalized scheme
                                                                          of the described algorithm is presented on Figure 12.


Fig. 8. LSTM text generation result.


                                                                          Fig. 11. Simplified image of a graph composed from speeches of characters
                                                                          from Shakespeare’s plays.


Fig. 9. Prediction histogram for GRU-generated text.


Fig. 10. Prediction histogram for LSTM-generated text.

   To build a graph, the text on which the neural networks
                                                                          Fig. 12. Schematic representation of work of algorithm.
were trained was used (Figure 11).
                                                                              After processing the text generated by GRU and LSTM,
    Then the triples of words ℎ𝑖 , ℎ𝑖+1 и ℎ𝑖+2 . are examined.            almost all incorrectly composed words were replaced and the
Since we are working with a large text data set, the                      contextual connection between words in sentences improved
“windows” of the three words considered during the                        slightly. Figures 13, 14, 15 and 16 show the results after
operation of the algorithm are enough to correct the text.                adjustment.
Each word ℎ𝑖 in the sentence is checked for its presence in
the column, and then the words associated with it are                        As a result of applying the algorithm, 6 out of 7
considered, if there is no word in the column, then we shift              incorrectly composed words were eliminated in the text
our “window” by 1 step. Then check for the presence of                    generated by GRU, and 7 out of 9 in the text from LSTM,
related words ℎ𝑖+1 . If ℎ𝑖+1 is in the list, then we shift the            GLTR analysis results were also improved.
“window” by 1 step and continue checking. If not, then look
at the word ℎ𝑖+2 . We check the presence of this word in the                                       VI. CONCLUSION
graph; if it is absent, we shift the “window” by 3 steps. If the              We presented the learning system of annotating on the
word is found, then we check through which words to which                 basis of the theory of graphs allowing to build the chains of
the links from h_i depart, it is possible to establish a                  words similar in style to the texts of authors, which if
connection with ℎ𝑖+2 . If there are connecting words, put
                                                                          required can be additionally improved downloading the
ℎ𝑖+1 in place of one of them, if not, then shift the “window”
                                                                          texts of another authorship or another topic. At further
by 2 steps. The choice of words that can be put in place of
                                                                          development of the algorithm of work of the elaborated
ℎ𝑖+1 is carried out by calculating the probability by the
formula:                                                                  system we consider its usage in typing sizable texts to be
                                                                          expedient, which allows to increase the rate of their writing.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                         124
Data Science

                                                                          more effectively deal with deadlines (which, by the way,
                                                                          have already entered the norm of modern life).


Fig. 13. Prediction histogram for corrected text generated by GRU.


Fig. 14. Prediction histogram for corrected text generated by LSTM.


                                                                          Fig. 16. The result of applying the corrective algorithm on the text
                                                                          generated by LSTM.

                                                                              Also, this program can provide assistance in the
                                                                          preparation of advertising articles, political campaign
                                                                          materials. Allowing you to analyze large textual volumes (for
                                                                          example, articles on the Internet or in print media) to
                                                                          determine the intentions, psychological state of target groups,
                                                                          identify attitudes, interests and values, belief systems by
                                                                          highlighting the most commonly used expressions and turns.
                                                                          Subsequently, relying on these stable constructions, using
                                                                          them in composing his own texts, the author acts between the
                                                                          lines on the readers unconscious mind, letting him know that
                                                                          they speak the same language, their problems and ideals are
                                                                          the same, thereby increasing the level of openness for the
                                                                          information presented and trust in it.
                                                                              But there is also a category of people who find it difficult
                                                                          to type texts on a computer keyboard due to limited health
                                                                          capabilities. For example, a person with spastic disorders in
                                                                          the upper extremities who works on a PC. Each movement is
                                                                          much more difficult for him, with greater efforts than a
                                                                          conditionally healthy one, and besides, his exhaustion and
Fig. 15. The result of applying the corrective algorithm on the text      fatigue comes much faster. And here, the use of the
generated by GRU.                                                         application will act as a significant assistant, allowing you to
    An algorithm was also developed that, when used in                    minimize arbitrary movements, therefore, the energy
conjunction with a graph of keywords, can be applied to texts             expended.
generated by neural networks to correct incorrectly generated                  Thus, the first algorithm presented:
words and increase contextual connectivity. In addition to
this, this algorithm improved the results of the analysis of the               Simplifies typing, as it learns the style of the author’s
GLTR text.                                                                      letters and suggests the most appropriate words for
                                                                                subsequent input;
    This application has an extensive scope: for example, in
the current situation, given the unfavorable epidemic                          Saves time, because instead of manual typing, you
situation caused by coronavirus infection, the majority of the                  can use the options of the displayed words provided
population have a need to master the work remotely. At the                      by the algorithm, which partially automates the
same time, the specificity of each sphere implies a certain                     process of working with text;
terminology, a set of the most used speech turnovers when                      Increases productivity, reducing time costs, making it
creating a product description, correspondence with                             possible to do more work in the same amount of time.
customers and partners. This program will simplify and
accelerate the work on a set of textual material, which in turn               As a result, the totality of these advantages allows you to
will increase the productivity of the labor process, help to              increase productivity.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                    125
Data Science

    And the second one algorithm:                                                                         REFERENCES
                                                                          [1]   W. Tan, B. Blake and L. Saleh, “Social-Network-Sourced Big Data
     Fixes errors in incorrectly generated words by                            Analytics,” Open systems. DBMS, no. 8, pp. 37-41, 2013.
      replacing them;                                                     [2]   I.A Rytsarev, D.V. Kirish and A.V. Kupriyanov, “Clustering of media
                                                                                content from social networks using bigdata technology,” Computer
     Increases the contextual coherence of the text by                         Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179-
      replacing words with those that have associations                         2018-42-5- 921-927.
      within the graph, and which are met in the original                 [3]   “Social network analytics: 10 ways to use monitoring systems,”
      text, written by a human.                                                 YouScan - Social Media Monitoring System, 2019 [Online]. URL:
                                                                                https://youscan.io/ru/blog/10-instrumentov-analiza-socsetei/.
    This transformation brings the text closer to the style of            [4]   I.V. Dmitriev, “Content analysis: essence, tasks, procedures,” PSI-
the author, allowing the text to look less similar to the one                   FACTOR. – Center for Practical Psycology, 2005 [Online]. URL:
that was compiled by a machine.                                                 https://psyfactor.org/lib/k-a.htm.
                                                                          [5]   “Natural-language generation,” Wikipedia [Online]. URL:
                       ACKNOWLEDGMENT                                           https://en.wikipedia.org/wiki/Natural-language_generation.
    The work is done with the financial support from the                  [6]   P.G. Osminin, “Modern approaches to automatic summarization,”
                                                                                Bulletin of South Ural State University. Series: Linguistics, no. 25,
Russian Foundation for Basic Research (No. 18-37-00418,                         pp. 134-135, 2012.
No. 19-29-01135, No. 19-31-90160) and the Ministry of                     [7]   I.A. Rytsarev, A.V. Blagov and M.I. Khotilin, “Development and
Science and Higher Education of the Russian Federation                          implementation of services to collect social networking data in order
(grant # 0777-2020-0017) in the framework of fulfilling the                     to improve the human environment,” Collected papers of ITNT.
                                                                                Information technologies and nanotechnologies, pp. 2452-2457, 2018.
governmental task of Samara University and FSRC
                                                                          [8]   H. Strobelt and S. Gehrmann, “Catching a Unicorn with GLTR: A
"Crystallography and Photonics" of RAS.                                         tool to detect automatically generated text,” Catching Unicorns with
                                                                                GLTR, 2019.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                          126

</pre>