=Paper=
{{Paper
|id=Vol-2667/paper27
|storemode=property
|title=Building a graph of a sequence of text units to create a sentence generation system
|pdfUrl=https://ceur-ws.org/Vol-2667/paper27.pdf
|volume=Vol-2667
|authors=Maksim Kaminskiy,Igor Rytsarev,Alexander Kupriyanov,Maximilian Khotilin
}}
==Building a graph of a sequence of text units to create a sentence generation system ==
Building a graph of a sequence of text units to
create a sentence generation system
Maksim Kaminskiy Igor Rytsarev Alexander Kupriyanov
Samara National Research University Samara National Research University; Samara National Research University;
Samara, Russia Image Processing Systems Institute of RAS Image Processing Systems Institute of RAS
beefiestracer@gmail.com - Branch of the FSRC "Crystallography - Branch of the FSRC "Crystallography
and Photonics" RAS and Photonics" RAS
Samara, Russia Samara, Russia
rycarev@gmail.com alexkupr@gmail.com
Maximilian Khotilin
Samara National Research University
Samara, Russia
turbomax.1994@gmail.com
Abstract—The article is devoted to the development of a purpose of this method is to identify or measure various
text data analysis system. The approaches to the presentation facts and trends reflected in the investigated documents.
of text from the posts of a single page in the form of a Using content analysis, it is possible to establish both the
dictionary of phrases for sentence generation and applying a characteristics of information sources and the characteristics
developed system for correcting the results of neural network
of the communication process. Content analysis can be used
generation are considered. Within the framework of the work,
data collection, filtering and processing using Big Data to study most of the documentary sources, but it works best
technologies were implemented. with a relatively large amount of single-order data [4].
Hence, it is so vital to be able to represent these data in the
Keywords—annotation, social networks, big data, graph, form convenient for the efficient analysis.
machine learning
From a commercial point of view, the most successful
I. INTRODUCTION Natural-language generation (NLG) applications have been
The 'social network' notion was used by sociologists data-to-text conversion systems that generate text
back in the 1920s for investigating the interrelations summaries of databases and datasets. These systems usually
between participants of different communities. The perform data analysis as well as text generation. Research
psychologist Iacob Moreno offered sociograms representing has shown that text-based resumes can be more effective
graphs on which separate individuals were represented by than graphics and other visual elements for decision support,
points, and interrelations between them – by lines. The idea and that computer-generated texts can outperform (from the
of using the apparatus of the theory of graphs for studying reader's point of view) human-written texts. There is
interrelations between people was taken by specialists in currently considerable commercial interest in using NLG to
such areas as sociology, psychology, anthropology, aggregate financial and business data. Gartner has said that
politology, economics – thus, the Social Network Analysis NLG will become the standard tool for 90% of modern BI
flow was established, dealing with studying structural (Business intelligence) and analytics platforms. NLG is also
properties of social interrelations modeled in the form of used for commercial purposes for automated journalism,
graphs and networks. Building the model based on various chatbots, creating product descriptions for e-Commerce
data from printed media, additional inquiries and sites, and compiling brief medical records [5].
questioning was an important but rather time-consuming The text annotating methods can be broken down into
stage of such investigation [1]. two groups: extracting and generating. Among the
Contemporary social networks substantially have made extracting methods of automatic annotating the method on
the life of researchers easier, having presented to them the the basis of the theory of graphs can be distinguished, where
developing and easily-accessible source of big data. Every the text is presented as a graph, which nodes are text
day the users of social networks generate large volumes of fragments, and edges are relations among them [6].
data of different type. The analysis results of this II. TASK SETTINGS
information may become a perfect material for
investigations of various fields [2]. For example, Social The modern world is dynamic, computerized, the
Media Marketing (SMM), is an important tool for employee is required to complete a task fast and qualitatively
to the greatest possible extent. The software that uses the
promoting in the Internet for many companies. Social
developed algorithm in its work can be used by employees
networks are an environment in which all users with the occupations, where it is necessary to print the text
unconsciously work as focus groups, and do not hesitate to for drawing up similar-in-content documents decreasing the
share their opinions, argue, prove their case, express their time spent for such task, or in organizations servicing the
needs and wishes. Companies are constantly looking for citizens with disabilities (static and dynamic disorders of
client insights that people share on social networks [3]. One upper limbs, visual impairments) on the quoted places, duties
of the tools for these studies is content analysis - a text of which are directly related to the work with computers.
analysis method that is carried out by counting the Also, the software tool can be used in the field of education,
occurrence of components in the analyzed information, used providing students with the opportunity to save time on
in sociology, as well as in computer technology. The reporting on completed work. With its help, it is possible to
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science
facilitate blogging on social networks for professional, When improving a new portion of processed information
entertaining or educational purposes, as the algorithm will is introduced to the graph. However, since the weight of a
learn the style of the written texts and begin to suggest the new bond primarily will be less than that of the bonds
most suitable words for input. already existed in the graph, for compensation purposes a
new structure at every node is introduced, which is
III. COLLECTION AND WORKING WITH DATA represented as a stack of words (𝐾 = 𝑘1 𝑘2 𝑘3 … 𝑘𝑚 , where
The algorithm developed in the framework of the 𝑘𝑗 is a word taken separately from the stack). It has the latest
research, first of all, collects data, then filtering it in order to bonds created after the node. The priority for output will be
obtain the crucial text information, then building a graph of given to new data, and the compensation of the low weight
key words, when passing on the chains of words are built. of the bond will be performed by means of introducing a
Further on, if required, the system can be additionally coefficient 𝑠, which depends on the position of the word in
improved adding new texts belonging to other authors, for the stack, selecting which logic chains for two sets of data at
style combining [7]. once can be built, but the second one will have some priority,
because it was used for improving the system. The
One of the most known weblog platforms LiveJournal
summarized scheme of work of the described algorithm is
was chosen as a source of data, which represents the
given on Figure 2.
possibility of publishing own records and commenting on
others. This large resource abounds with weblogs on various
topics, being an excellent source of large volumes of text
information. All obtained information is stored in the text file
to work with after that.
The data must be prepared for further work with the text.
Hyperlinks, emojis, punctuation marks, special characters are
filtered out, all other letters are converted to lowercase. The
words with the length less than four characters are filtered
out as well in order to exclude the majority of auxiliary parts
of speech. After that the text is structured into separate key
words. Lemmatization of tokens, i.e. reducing the words into
their initial form, is performed after that. Under
lemmatization the parts of speech are transformed according
to the following type: nouns – singular, nominative case;
adjectives – singular, masculine, nominative case; verb –
indefinite form (infinitive). Example of lemmatization can be
seen on Figure 1. Fig. 2. Schematic representation of work of algorithm.
IV. COMPARING OF STYLES OF DIFFERENT AUTHORS
Two posts dedicated to the "Cats" screen musical by two
different authors were taken for the research. Having
obtained the text, the data were filtered and processed for
making the vocabulary of key words and matrix of phrases.
Two weighted graphs were built after that, and they are
presented on Figure 3 and 4.
The first graph has 297 nodes and 385 edges, the second
one has 296 nodes and 384 edges. Having compared them
totally 49 nodes having the same name were found. For these
49 coincidences the big difference between neighboring
nodes can be observed, making a conclusion that the
frequency of coincided words with the authors differs.
Further on, we consider the total capacity of every node
from the graph. As seen from Figure 5 and 6, that the first
author frequently uses certain words (for example, the
variation of the word «быть» ("be")), while the usage of
Fig. 1. Example of transformation of words into lemma. words by the other author is more even.
The vocabulary is created after these transformations, In the result of these comparisons it can be concluded
arranged by the frequency of application of key words, based that the lexicon of the authors substantially differs regardless
on which the phrase matrix is built, the terms of which there the writing of articles on the topic alike.
will be a number of repetitions among the words in the text. The developed algorithm was also applied to texts
Further on, having the phrase matrix and the vocabulary generated by GRU and LSTM neural networks to eliminate
𝑊 = 𝑤1 𝑤2 𝑤3 … 𝑤𝑛 , a graph can be built. The nodes of the word errors and increase contextual connectivity. As a
graph will be key words 𝑤𝑖 from the vocabulary 𝑊 , the dataset for training neural networks of text generation, a text
edges connect them into phrases from the text. The number consisting of speeches of the characters of Shakespeare's
of repetitions among the words is given as the edge weight. plays was taken. To check the generated texts, the GLTR
VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 122
Data Science
(Giant Language model Test Room) was selected, which is a To correct the received texts, an algorithm was developed
tool for detecting text that was automatically generated. This that, in conjunction with the constructed graph, allows to
instrument can use any text data and analyze what language correct errors in words and increase contextual connectivity.
model GPT-2 would predict in each position. Each text is
analyzed according to how likely each word will be a
predicted word, taking into account the context on the left. If
the actual word used would be in the top 10 predicted words,
the background is colored in green, for the top 100 in yellow,
the top 1000 in red, otherwise in purple. On Figures 7 and 8,
you can see the results of the analysis of texts, and Figures 9
and 10 show histograms where the number of predictions for
each of the texts is calculated [7].
Fig. 5. Capacity of nodes of first graph.
Fig. 3. Simplified representation of the graph drawn by the "Musical
"Cats" post in cinema: mutants not able to sing", created by the user named
shakko_kitsune.
Fig. 6. Capacity of nodes of second graph.
Fig. 4. Simplified representation of the graph drawn by the "Cats: purring
musical", created by the user named carabas.
V. CORRECTION OF GENERATED TEXTS
As can be seen from the result of the generation, the text
generated by GRU turned out to be not too contextually
connected and 7 words were displayed with errors, in the
LSTM generated text there are slightly more errors in words
– 9, but according to the data from the histograms, it
surpassed the previous neural network.
Fig. 7. GRU text generation result.
VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 123
Data Science
𝑔
𝑝𝑖 = ∑𝑁 𝑖 , (1)
𝑗=1 𝑔𝑗
where 𝑔𝑖 is the weight of the edge between ℎ𝑖+1 and the
word connecting it to ℎ𝑖 i, 𝑝𝑖 is the probability of choice and
𝑁 is the number of connected words. A generalized scheme
of the described algorithm is presented on Figure 12.
Fig. 8. LSTM text generation result.
Fig. 11. Simplified image of a graph composed from speeches of characters
from Shakespeare’s plays.
Fig. 9. Prediction histogram for GRU-generated text.
Fig. 10. Prediction histogram for LSTM-generated text.
To build a graph, the text on which the neural networks
Fig. 12. Schematic representation of work of algorithm.
were trained was used (Figure 11).
After processing the text generated by GRU and LSTM,
Then the triples of words ℎ𝑖 , ℎ𝑖+1 и ℎ𝑖+2 . are examined. almost all incorrectly composed words were replaced and the
Since we are working with a large text data set, the contextual connection between words in sentences improved
“windows” of the three words considered during the slightly. Figures 13, 14, 15 and 16 show the results after
operation of the algorithm are enough to correct the text. adjustment.
Each word ℎ𝑖 in the sentence is checked for its presence in
the column, and then the words associated with it are As a result of applying the algorithm, 6 out of 7
considered, if there is no word in the column, then we shift incorrectly composed words were eliminated in the text
our “window” by 1 step. Then check for the presence of generated by GRU, and 7 out of 9 in the text from LSTM,
related words ℎ𝑖+1 . If ℎ𝑖+1 is in the list, then we shift the GLTR analysis results were also improved.
“window” by 1 step and continue checking. If not, then look
at the word ℎ𝑖+2 . We check the presence of this word in the VI. CONCLUSION
graph; if it is absent, we shift the “window” by 3 steps. If the We presented the learning system of annotating on the
word is found, then we check through which words to which basis of the theory of graphs allowing to build the chains of
the links from h_i depart, it is possible to establish a words similar in style to the texts of authors, which if
connection with ℎ𝑖+2 . If there are connecting words, put
required can be additionally improved downloading the
ℎ𝑖+1 in place of one of them, if not, then shift the “window”
texts of another authorship or another topic. At further
by 2 steps. The choice of words that can be put in place of
development of the algorithm of work of the elaborated
ℎ𝑖+1 is carried out by calculating the probability by the
formula: system we consider its usage in typing sizable texts to be
expedient, which allows to increase the rate of their writing.
VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 124
Data Science
more effectively deal with deadlines (which, by the way,
have already entered the norm of modern life).
Fig. 13. Prediction histogram for corrected text generated by GRU.
Fig. 14. Prediction histogram for corrected text generated by LSTM.
Fig. 16. The result of applying the corrective algorithm on the text
generated by LSTM.
Also, this program can provide assistance in the
preparation of advertising articles, political campaign
materials. Allowing you to analyze large textual volumes (for
example, articles on the Internet or in print media) to
determine the intentions, psychological state of target groups,
identify attitudes, interests and values, belief systems by
highlighting the most commonly used expressions and turns.
Subsequently, relying on these stable constructions, using
them in composing his own texts, the author acts between the
lines on the readers unconscious mind, letting him know that
they speak the same language, their problems and ideals are
the same, thereby increasing the level of openness for the
information presented and trust in it.
But there is also a category of people who find it difficult
to type texts on a computer keyboard due to limited health
capabilities. For example, a person with spastic disorders in
the upper extremities who works on a PC. Each movement is
much more difficult for him, with greater efforts than a
conditionally healthy one, and besides, his exhaustion and
Fig. 15. The result of applying the corrective algorithm on the text fatigue comes much faster. And here, the use of the
generated by GRU. application will act as a significant assistant, allowing you to
An algorithm was also developed that, when used in minimize arbitrary movements, therefore, the energy
conjunction with a graph of keywords, can be applied to texts expended.
generated by neural networks to correct incorrectly generated Thus, the first algorithm presented:
words and increase contextual connectivity. In addition to
this, this algorithm improved the results of the analysis of the Simplifies typing, as it learns the style of the author’s
GLTR text. letters and suggests the most appropriate words for
subsequent input;
This application has an extensive scope: for example, in
the current situation, given the unfavorable epidemic Saves time, because instead of manual typing, you
situation caused by coronavirus infection, the majority of the can use the options of the displayed words provided
population have a need to master the work remotely. At the by the algorithm, which partially automates the
same time, the specificity of each sphere implies a certain process of working with text;
terminology, a set of the most used speech turnovers when Increases productivity, reducing time costs, making it
creating a product description, correspondence with possible to do more work in the same amount of time.
customers and partners. This program will simplify and
accelerate the work on a set of textual material, which in turn As a result, the totality of these advantages allows you to
will increase the productivity of the labor process, help to increase productivity.
VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 125
Data Science
And the second one algorithm: REFERENCES
[1] W. Tan, B. Blake and L. Saleh, “Social-Network-Sourced Big Data
Fixes errors in incorrectly generated words by Analytics,” Open systems. DBMS, no. 8, pp. 37-41, 2013.
replacing them; [2] I.A Rytsarev, D.V. Kirish and A.V. Kupriyanov, “Clustering of media
content from social networks using bigdata technology,” Computer
Increases the contextual coherence of the text by Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179-
replacing words with those that have associations 2018-42-5- 921-927.
within the graph, and which are met in the original [3] “Social network analytics: 10 ways to use monitoring systems,”
text, written by a human. YouScan - Social Media Monitoring System, 2019 [Online]. URL:
https://youscan.io/ru/blog/10-instrumentov-analiza-socsetei/.
This transformation brings the text closer to the style of [4] I.V. Dmitriev, “Content analysis: essence, tasks, procedures,” PSI-
the author, allowing the text to look less similar to the one FACTOR. – Center for Practical Psycology, 2005 [Online]. URL:
that was compiled by a machine. https://psyfactor.org/lib/k-a.htm.
[5] “Natural-language generation,” Wikipedia [Online]. URL:
ACKNOWLEDGMENT https://en.wikipedia.org/wiki/Natural-language_generation.
The work is done with the financial support from the [6] P.G. Osminin, “Modern approaches to automatic summarization,”
Bulletin of South Ural State University. Series: Linguistics, no. 25,
Russian Foundation for Basic Research (No. 18-37-00418, pp. 134-135, 2012.
No. 19-29-01135, No. 19-31-90160) and the Ministry of [7] I.A. Rytsarev, A.V. Blagov and M.I. Khotilin, “Development and
Science and Higher Education of the Russian Federation implementation of services to collect social networking data in order
(grant # 0777-2020-0017) in the framework of fulfilling the to improve the human environment,” Collected papers of ITNT.
Information technologies and nanotechnologies, pp. 2452-2457, 2018.
governmental task of Samara University and FSRC
[8] H. Strobelt and S. Gehrmann, “Catching a Unicorn with GLTR: A
"Crystallography and Photonics" of RAS. tool to detect automatically generated text,” Catching Unicorns with
GLTR, 2019.
VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 126