=Paper= {{Paper |id=Vol-3132/Paper_1 |storemode=property |title=Automated Bot Detection Based on Coherence Metric |pdfUrl=https://ceur-ws.org/Vol-3132/Paper_1.pdf |volume=Vol-3132 |authors=Oleksandr Marchenko,Mariam Isoieva |dblpUrl=https://dblp.org/rec/conf/iti2/MarchenkoI21 }} ==Automated Bot Detection Based on Coherence Metric== https://ceur-ws.org/Vol-3132/Paper_1.pdf
Automated bot Detection Based on Coherence Metric
Oleksandr Marchenko and Mariam Isoieva
Taras Shevchenko National University of Kyiv, 60, Volodymyrska Str., Kyiv, 01033, Ukraine

                 Abstract
                 This paper describes a model of bot detection based on a coherence metric. Much attention is
                 being paid to natural language generation technologies since the middle of the previous
                 century, many companies invest in research in this direction. The quality of automatically
                 generated texts becomes better. Nowadays, it is hard to distinguish machine-generated texts
                 from those written by humans. While this can be used for helping people in automating even
                 creative tasks, there is also a downside of such technologies being widely available, e. g.
                 easier spread of fake news and propaganda. That is why it is also important to develop
                 efficient bot detection methods for having misinformation protection tools. Here an attempt
                 to create such a model is made. One of the main distinguishing features of high-quality texts
                 is coherence. This is also what automatically generated texts, especially long ones, often lack
                 in. A classifier with a set of features based on a coherence metric and syntactic characteristics
                 has been built. The method can be extended and used for different languages.

                 Keywords 1
                 Bot detection, natural language generation, coherence

1. Introduction
    The human mind is in the process of constant evolution, new intellectual needs arise.
Automatically generated texts can be used for good purposes: for education, entertainment, for easier
and faster knowledge gathering and summarization. But there is also a dark side to these technologies.
They are used for sharing propaganda and fake news, conducting illegal political campaigns,
committing financial crime through automatic credit applications generation, for misleading people.
And such texts are being spread everywhere faster and faster. Posting of such texts in social networks
and web resources can be automated. That is why bot detection technologies are needed. While much
attention is paid to the development of natural language generation systems, the detection of machine-
generated texts is out of the spotlight. Are there more potential dangers than everyone sees?
    With the rapid development of natural language generation technologies, it is hard to distinguish
automatically created texts from those written by humans. While many existing methods for bot
detection rely on features related to supporting information, such as settings of the analyzed accounts
in social networks or their activity, less attention is paid to such analysis based on pure text
characteristics. The main goal of this work is to develop a model for the identification of
automatically generated text based on its coherence.
    Coherence is one of the main characteristics of high-quality text, which is easy to perceive and
understand. This concept is multi-faceted. Coherence depends on both semantic and syntactic
features, and automatically generated texts, especially long ones, lack coherence. There can be logical
breaks or topic switches inside the text, which make the text inconsistent. That is why it is reasonable
to use a coherence metric, e.g. [1], as a distinguishing feature for bot detection.

2. Models of natural language generation
    Natural language generation (NLG) is considered one of the most important milestones for
artificial intelligence. Solving the task of automatic text creation requires using powerful algorithms.

Information Technology and Implementation (IT&I-2021), December 01–03, 2021, Kyiv, Ukraine
EMAIL: omarchenko@univ.kiev.ua (A. 1); isoyevamaryam@gmail.com (A. 2)
ORCID: 0000-0002-5408-5279 (A. 1)
            ©️ 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)



                                                                                                                1
It is also valuable to understand how text is being generated by the models used today to build an
efficient bot detection system. Some of the modern methods of NLG and their classification are
described in this section.
    Different criteria for the classification of NLG tasks and methods related to the generation of
natural language text can be defined.
    Some systems are used to generate texts only on specific topics, and others can be used to generate
texts on arbitrary topics. For generating texts related to a predetermined topic, it is possible to expand
the dataset and to train the model on thematically related data, to use tools such as specialized
dictionaries, knowledge bases. The construction of “universal” text generators requires finding ways
to obtain new data at the user’s request or the availability of significant amounts of data that can be
used for generation. Thematic diversity is necessary for systems of general dialogue, summarization,
question answering.
    In addition, universal systems must preserve the consistency of style for the specific text generated
and for all texts produced by the system. This aspect is crucial for algorithms based on the extraction
of sentences from different corpora. In addition to syntactic integrity and semantic unity, the stylistic
homogeneity of the text is also an important indicator of quality.
    By the methods of text creation, models can also be classified into extractive and abstractive.
Considering the example of automatic summarization, compiling an abstract from sentences of the
original text is called extractive. Abstractive summarization means creating a new text without
directly using original sentences as parts of the summary formed.
    For natural language generation systems, some input data is usually required as a basis for the
future text. Such data may be structured, e. g. numbers, maps of fields and corresponding values,
graphs, or unstructured, such as human-written text. This can be a corpus of texts, on which the model
is trained, or which is used to find and convert the necessary fragments (for example, when sentences
from the corpus are used). For question answering systems, the text corresponding to the request is
generated based on the questions formulated by a person. In addition, the input of the natural language
text generation system can contain numerical data: characteristics of the observed phenomenon or
process, indicators for generating reports, etc. A natural language text generation system may also
have a subsystem of semantic analysis of the input text or a set of texts for creating abstract
representations of the future text content. Such formal representations can be built based on certain
data and be set as a basis for text formation.
    The length of the text generated by the system can be controlled. For example, the natural
description of the image may consist of one sentence or phrase, the user can specify the number of
sentences in the automatically generated summary in advance: generated short descriptions of news
articles are usually one or several sentences long and automatically generated articles on a particular
topic can be much longer. The length of the texts produced by a deep learning model often depends
on what data it has been trained on. The capability to generate long texts can be determined by the
complexity of the model. Many issues arise when it comes to the generation of long texts. For
example, recurrent neural networks can have self-repetitions when the same fragments of text are
repeated several times. It is also hard to achieve coherence and ease of perception for human readers.
These may require some additional mechanisms for the control of semantics of the text being
generated. There are interactive systems for generating natural language texts, for which a user can
control some stages of generation, make intermediate decisions, and “static” ones, for which human
control is possible only at the level of the basis, the request. Different problems require consideration
of their peculiarities for their efficient solving.While methods of text generation depend on the tasks
which they solve, there is a common basis for such algorithms.
    Reiter and Dale summarize the experience of computational linguists and formulate the main steps
of automatic generation of natural language text: definition of content, discourse planning, sentence
aggregation, lexicalization, generation of key concepts, linguistic implementation [2]. In the first
stage, the information that should be specified in the generated text is defined. The form of
representation and the method of its formation depend on the generation problem formulation. At this
stage, the concepts that will be described in the text and the connections between them that need to be
highlighted are defined. The second stage is “planning” of the discourse, i.e., determining the basic
structure of the future text, organizing the semantic structures identified at the first stage. During the
next phase, the sentences representations are formed and combined into larger structures. At this stage

                                                                                                        2
the basis of coherence of the future text is set: the concepts, which are connected, are grouped to form
some structures. Lexicalization is the choice of those words and phrases that reflect the essence of the
underlying meaning and are correct to use. Most of the first text generation systems used prepared
phrases suitable for the related domain or topic. Lexical diversity is a sign of high-quality text.
Increasingly complex generation methods are being developed to achieve it, researchers try to find
methods for automatic expansion of thesauri and ontologies. The next step is the definition of the
main entities, the concepts to be discussed, for example, certain proper names that need to be
mentioned in the text [2]. For cases when the basis of generation is natural language text, there are
problems of selection of the named entities and the choice of language pointers. These steps are
necessary for further coordination of the elements of the generated text, selection of syntactic
structures, etc. Linguistic implementation is one of the most important stages, as it is the process of
generating the text itself. Grammar rules of a particular language should be followed. The main focus
is on this step nowadays, some of the previous ones can be omitted.
    Figure 1 summarizes the mentioned stages of natural language texts generation.




Figure 1: Common steps for natural language generation, based on [2]
   Methods of generating natural language text, based on the use of deep learning models are the
most popular and widely used today. But they have some limitations. Most models, such as basic
recurrent neural networks, GRU (Gated recurrent unit), LSTM (Long Short Term Memory) and their
modifications, can be used to generate short texts, the length of which is one or more sentences. More
complex models, such as the GPT family, are used to generate longer texts. But the size of the trained
model is a significant barrier for using such methods in practice, as there is a need for significant
computing resources, access to which is often limited and expensive.
   In addition to that, since neural networks remain black boxes, the improvement of such models is
complex and is usually based on empirical studies of modified versions of the base model. There is
also no guarantee of the quality of the generated text, especially if it is generated based on data that is
new to the model, for example, based on texts on topics not present in the training sample.
   Attempts to explain the process of data processing and generation by neural networks are a
separate area of research of deep learning methods.

                                                                                                         3
    Deep learning methods can be used to generate locally coherent short texts (one or several
sentences long), but the generation of longer sequences is still challenging. Generated text can be
inconsistent, hard to read, describe one phenomenon or fact several times or have inappropriate
semantic breaks. Recurrent neural networks are often used to generate text. Various modifications of
the basic architecture are developed to improve the generated text, in particular, to achieve a certain
level of coherence and semantic unity.
    Kiddon et al. proposed to extend the basic model of generation, in particular, to use a “list” of what
should be described in the text, to ensure the coherence of the text generated by the recurrent neural
network [3]. The developers of the pointer generator network architecture also use information about
the text that has already been generated and parts of the original text that have been analyzed [3]. This
model is used for the correct representation of factual information for automatic summarization.
    Generative adversarial network (GAN) is a popular architecture for generating images and texts. It
consists of two parts: generator and discriminator. The discriminator recognizes machine-generated
images or texts and becomes more resistant to noise in its input data, and the generator learns to
“fake” the distribution of real data.
    With sufficient amounts of training data, GANs generate impressively realistic images of people,
their faces, animals, etc. Apart from that, this architecture is used to generate texts. The discriminator
“learns” to recognize machine-generated text, and the generator trains to form texts, which are similar
to the human-written ones.
    Generation of long texts has been studied by Sang Cho et al.; researchers propose to use GAN [4].
They emphasize the importance of local and global coherence. Namely, two discriminators have been
used: one to assess cohesion, the other to evaluate coherence. Researchers make the basic assumption
that global coherence depends on the organization of sentences and focus on assessing coherence at
the sentence level of individual paragraphs, not just adjacent sentences. Evaluation of cohesion within
sentences is based on the technique of the Deep Structured Semantic Model, which was originally
used to determine semantic similarity [4]. Previous attempts to generate text using GAN have used
language models as discriminators, which do not always adequately address aspects of text coherence.
    Most generation methods are based on the use of some raw or pre-processed data: individual
sentences of the corpus, structures that describe the plot line (for generating stories), formal graph
structures. Depending on the presentation format of such information and the method of generation,
there are different ways to reduce the amount of data searched and speed up their analysis.
    Even for heavy and powerful systems, such as GPT-2 or GPT-3, it is hard to model the long-term
dependencies for bigger texts, preserve the consistency, create the correct structure of the text. The
main indicator of such a problem is the low level of coherence, when different parts are not logically
united, connections between them are inappropriate or absent, the whole text is hard to read.
Measuring coherence is a hard task even for humans but establishing some basis for evaluation is
required for further improvement of NLG and bot detection models.

3. Aspects of coherence
    For the solution of any problem, which requires generation of natural language text, coherence of
the text is necessary for human perception of the result. Definitions of coherence differ, but most of
them concentrate on the semantic aspect. According to the Cambridge Dictionary, “coherent” means
clear, consistent, characterized by the relatedness of its parts [5]. Linguists study the coherence of the
text as a phenomenon that finds its manifestation through the use of certain linguistic means: ellipses,
usage of synonyms, conjunctions, linguistic references and other connections. The study of coherence
is important for the development of NLG systems, as well as for bot detection models.
    The coherence of any text depends on the semantic relatedness of the main concepts, phenomena,
ideas mentioned, the syntactic consistency of the components of individual sentences and their unity
within a group of sentences.
    Depending on the level of consideration, concepts of local and global coherence of the text are
defined. Local coherence exists at the level of individual sentences, connections between parts of
neighboring sentences, it relates to semantic transitions between successive sentences. This property
is important for the creation of high-quality natural language text. Local coherence is necessary for
global coherence, which is the logical unity, the integrity of the whole text.

                                                                                                        4
    Sevbo [6] have studied how the main idea is transferred and developed from sentence to sentence.
The researcher focuses on the realization of coherence using repetition of meaningful words in
consecutive sentences. The algorithm proposed in the work [6] is based on determining the syntactic
structure of sentences of the input text, construction of “phrase trees” that describes it. In addition to
the repetition of words, it is necessary to take into account anaphora, coreferences.
    A method of evaluation of the coherence level, developed by Lapata and Barzilay, is based on the
model of discourse representation [7]. Researchers propose to map each text to a matrix of entities
that reflects the distribution of elements of discourse in sentences. The rows of this matrix correspond
to the sentences, and the columns represent the elements of the discourse. The corresponding elements
of the matrix indicate the presence or absence of an element of discourse in the sequence of sentences
and their syntactic role. Special notations are used for different grammatical categories (subjects,
objects, and others). If the element for the corresponding column is absent in the sentence, a special
mark ‘-’ is put [7]. Coreferences for the relevant elements are also considered. Dependency parsers
are used to determine the grammatical role of each word, in particular, the authors propose to use
parsers developed by Lin [8] and Briscoe [9]. The quality of coherence modelling depends
significantly on how accurate the coreference resolution systems and dependency parsers are.
Researchers emphasize that the discourse representation model plays an important role. This method
is based on previous studies of the coherence of the natural language texts, in which the main focus is
on the study of the influence of grammatical connections between its elements on the local coherence
of the text. The authors also make an assumption that the distribution of elements of discourse for
coherent texts follows some pattern. They use the achievements from the theory of centering, for
which the main aspect of coherence is the number and nature of semantic transitions and changes in
“focus”, i.e., the main object mentioned [7]. Based on these assumptions, it can be argued that densely
filled matrices correspond to coherent texts, while the sparseness of the matrix may indicate a weak
semantic connection between the sentences of the text under consideration. This study of coherence
focuses on the definition of typical patterns of coherent texts structure in the context of the described
model of text representation.
    The authors of the model have proposed to form numerical vector representations of texts based on
the analysis of probabilities of syntactic role changes in consecutive sentences for each entity. They
can be used for the classification of texts and their subsequent study, in particular, for evaluating the
coherence. The approach described above is also used to model coherence in text generation. In
particular, it is possible to use sentences for ranking to form the most coherent text from them or to
restore the original order of sentences. For example, McIntyre has used this technique to generate
stories [10]. In his work [10], the researcher has used a dataset of fairy tales by the writer Andrew
Lang. This choice of training data is associated with one of the tasks considered by the scientist,
namely the generation of stories and fairy tales that could be used for teaching children.
    An important step in the study of textual coherence as a phenomenon and the study of methods for
its definition has been made by Grosz, Joshi, and Weinstein [11]. They have developed the theory of
centering, which has been used in the aforementioned works. Many further developments are based
on the properties of coherence identified by these researchers, and the expansion of the constructed
theory, in particular, the model described above. The name “centering” is inspired by the basic
statement of this theory. According to Grosz et al. [11], some of the entities mentioned in the text are
central. The speaker's choice of constructions, linguistic references is determined by the central
entities, their features. Therefore, the coherence of discourse depends on the relatedness, compatibility
of the properties of the central entity and those related [11].
    A technique proposed by Iida and Tokunaga is also related to the theory of centering [12]. The
algorithm uses the output of the subsystem for coreference resolution. The authors propose to use the
developed metric as one of the features to build a model based on the matrix of entities and
demonstrate the improvement of the basic model, taking into account the existing anaphors, the
entities described in the text, and the relationships between them.
    Since many methods of evaluating and modelling the coherence of the text require some data on
which the model is trained or the generation algorithm is based, there is an issue of making these
methods independent of a specific topic or subject. Apart from that, the techniques discussed above
focus on certain aspects of coherence. There is a need to improve these algorithms by integrating the
underlying paradigms.

                                                                                                        5
    Researchers Li and Jurafsky have considered methods of such improvement and proposed a
method for evaluating coherence and generating coherent text [13]. The proposed discriminative
model is based on deep learning methods. A set of consecutive sentences is an input. The central
sentence is considered together with the sentences surrounding it. Vector representation of each
sentence is obtained as an output of the LSTM network. The vectors corresponding to the sentences
are concatenated. Another neural network, the upper layer of which has a sigmoid activation function,
is used as a classifier to determine the probability that the input set of sentences is coherent. The
generative model is proposed in two variants: a modification of the basic sequence-to-sequence model
and a model based on the hidden Markov model and the LDA (Latent Dirichlet Allocation). The
training sample includes articles from Wikipedia.
    Basile et al. have defined a metric of connection between individual frames [14]. Some results of
this study were used in the development of the method [1], which has been applied in this work.
    In addition to logical and semantic coherence, there is also an important aspect of thematic
homogeneity and coherence. There are studies on this issue, which are based on the use of LDA [15].
Blei et al. define this generative probability model as a three-level hierarchical Bayesian model. LDA
is used for classification tasks, modelling of text topics, etc. The developers of this model have tried
to take into account the limitations that approaches based on TF-IDF (term frequency – inverse
document frequency) and LSI (Latent Semantic Indexing) have [15].
    The definition of coherence may also depend on the task for which the text is generated. For
example, for assessing the quality of dialogue systems, the specifics of this task must be considered.
Coherence metrics for dialogue should reflect these features and assess the general characteristics of a
logically constructed text (integrity within a single generated response, etc.).

4. Existing methods of bot detection
    Researchers from OpenAI in their report “Release Strategies and the Social Impacts of Language
Models” [16] classify the existing models of identification of automatically generated text into:
        those that are based on classical machine learning methods and learn from samples of pairs of
generated and corresponding real texts, written by humans (for example, when a part of the human-
written text is used as input for the generation model);
        zero-shot classifiers that use pre-trained generative models, such as GPT family or GROVER,
and are applied to texts generated by the same or a similar model and allow to determine the
probability with which this fragment could be generated by this particular model. The model is not
trained additionally. An example of such a classification system is GLTR [17];
        classifiers based on the model fine-tuning, for which the language model is trained, “learns”
to recognize itself in different configurations, with different values of hyperparameters [16].
    A semi-automatic method of verifying that the text is machine-generated was proposed by the
developers of the GLTR system [17]. GLTR is a tool designed to help a person determine whether the
text has been generated automatically. The results of the experiments show that the accuracy of
determining the machine-generated text by humans using GLTR increases from 54% to 72% [17].
When generating text sequentially word by word, the most common techniques for choosing the next
word from the most likely options are max, k-max sampling, beam search. The probabilities of each
word depend on the left context. GLTR visualizes outliers and artifacts that may indicate automatic
text generation. BERT and GPT-2 are considered as models, the outputs of which can potentially get
to the GLTR input [17]. Some further developments are based on this idea of checking the
consistency of the distribution. In particular, similar models have been developed for automatic
identification of machine-generated text for RoBERTa, XLNet and other models.
    Bao et al. use their own method of determining the coherence of natural language text to detect
machine-generated spam [18]. The training set for their experiment was formed from the news articles
in Chinese and English. The machine-generated part of the dataset consists of texts that were obtained
by automatic translation, summarization of texts, permutations of random words and sentences in the
original texts [18]. To model coherence, the researchers use pre-trained Bi-GRU. The vectors of the
internal states that were obtained after each of the “passes” through the sequence are used as features
for classification. They are fed to a convolutional neural network, the last layer of which is for binary


                                                                                                       6
classification [18]. The task of identifying machine-generated text can be considered as one that is
reduced to the task of authorship attribution. For example, consider the task of identifying a set of
accounts of “bots”, from which the text generated by one model is published in social networks. Thus,
if the text was generated by a model whose parameters did not change during generation, we can
assume that the use of algorithms for determining authorship is appropriate.
    Uchendu et al. consider the task of machine-generated text identification from the standpoint of
determining the authorship of the text [19]. The authors of the work “Authorship Attribution for
Neural Text Generation” consider the following tasks:
        if two texts are specified, determine whether they have been generated by one model
        establish whether the given text was written by a person or generated automatically
        for a given text and a set of generation methods, determine which one has been used [19].
    The authors conduct experiments with texts that have been generated using models CTRL, GPT-2,
GROVER, XLM, XLNET, PPLM, FAIR and others. According to their results, models GPT-2,
GROVER, FAIR are the most difficult to identify and distinguish the text produced by them from the
text written by people [19]. To solve the three problems, researchers have developed several common
basic architectures. The first option is to represent each word as a vector of 300 elements, summing
these vectors and applying a layer with softmax activation. The second model consists of a GRU
layer, to the output of which softmax is also applied. The third variant consists of a sequence of
convolutional layers, there are also variants with parallel convolutional layers and a combination of
RNN and CNN layers [19]. Each of these models is adapted to solve the above three problems.Zhong
et al. emphasize that most methods of identifying machine-generated text do not include mechanisms
for analyzing the actual structure of the text, which is a determining factor in distinguishing between
generated and man-made texts. Researchers propose a graph model in which the text is presented in
the format of a graph of entities [20]. A graph neural network is used to create a graph representation.
Then such representations of sentences are composed into a single representation of the whole text. In
addition, the relationships between adjacent consecutive sentences are modelled.
    Tay et al. studied the “artifacts” that the texts generated using different methods have and the
influence of factors such as decoding method, model size, input length. They consider a task of
identifying automatically generated text. The authors set a goal of better understanding of
fundamental properties of neural models designed to generate text [21]. The study of artifacts that are
present in the machine-generated text is a new and important area of research. The main conclusion of
the work, which was proved experimentally, is that there are such artifacts and that different
simulation variants can be identified by using only the text that was generated [21]. This suggests that
text generators may be more sensitive to different modelling options than previously thought. The
results of this work allow applying the classical methods of classification, specifying the artifacts
discovered by the method developed by them as features. In addition, researchers conclude that such
artifacts usually relate more to the lexical features of the text than to the syntactic structure.
    The aforementioned works of Tay et al. [21] and Bao et al. [18], the article by Uchendu et al. [19]
are consistent, based on similar assumptions. Bakhtin et al. investigated energy-based models and the
opportunities of their application to the problem of text generation [22]. They worked on developing a
method for improving the quality of such models, automating their self-tuning in the generation
process. To generalize their own method, they considered the problem of identifying machine-
generated text and the influence of configurations of such models on the complexity of determining
that the text was machine-generated. Classical machine learning methods, which are effective for
solving classification problems: support vector machine, logistic regression, naive Bayesian classifier,
etc., can also be used for the bot detection task. GPT-2 developers show that the baseline model of
logistic regression classification and TFIDF vectorization shows high results for smaller versions of
the GPT-2 model [16]. This baseline is quite difficult to surpass, but it is worth noting that this
approach does not take into account the quality of the text. The advantage of applying the method of
evaluating coherence, which was proposed in [1], to the problem of identifying automatically
generated text is in the method’s adaptability. It evaluates the general characteristics of the text, this
method is generalizable. Coherence is a key indicator of the quality of the text, its semantic
correctness, ease of human perception, reflects the correctness of the main idea, the integrity and
consistency of presentation, the applicability of the text to solve human problems.


                                                                                                        7
5. Available datasets
    One of the most well-known and effective models of natural language text generation is the GPT
family. These models, especially GPT-3, allow one to generate texts that are very similar to those
written by humans. That is why the methods for identifying that the text was generated by one of
these models are especially interesting and important. To train such a model, it is important to have
access to datasets of the generated texts, and, preferably, the human-written texts, on which such
models have been trained. OpenAI, a company, which develops GPT, has provided access to a set of
documents of the WebText dataset, on which GPT-2 has been trained, and for each such text, versions
of the text generated by the GPT-2 model with different settings are given [16]. This dataset has been
used in this work. There is also a dataset, for which researchers Solaiman et al. finetuned the GPT-2
model on the Amazon Reviews dataset [16]. The initial goal was to “teach” GPT-2 to generate more
natural reviews and comments. The difficulty with this dataset is that such comments are short in
length, often lacking significant context. Most of the models of bot detection analyzed in the previous
section show the worst results on short texts.
    Another example of a dataset for training models designed to solve the problem of identifying
automatically generated text is a combination of a RealNews dataset and a text set of texts generated
by the GROVER model. This model in the original study is trained on the RealNews dataset, which
consists of news articles. The model developers provide the part of this dataset that was not used in
training separately and examples of the texts produced by the model with certain configurations [23].
    Researchers Uchendu et al. have compiled a dataset of human-written news articles on
politics [19]. Parts of these articles are used as input for eight language models, including GPT-2,
GROVER, XLM, XLNet, PPLM, FAIR.Apart from that, a dataset of Twitter messages generated by
bots using different language models: those based on Markov chains, LSTM, basic RNN, GPT-2 and
others are available [24]. This dataset is hard to use due to the small length of texts. It is important
that these texts are real outputs of social network bots, so the study of such data can give a better
understanding of the identification of automatically generated texts, and analysis of the effectiveness
of already trained models on this dataset can show the applicability of developed methods.

6. Method of bot detection based on the coherence metric
    The task of bot detection is formulated in the following way.
    Let 𝑇 be the considered input text, 𝑇 = {𝑠𝑖 }, 𝑖 = 1. . 𝑁, where 𝑁 is the number of sentences in the
input text. By having only 𝑇 and no additional supporting data, such as publishing resource, settings
of the account, from which the text has been published, etc., determine which class the text belongs
to: automatically generated or human-written texts.
    A metric of coherence presented in [1] is used for defining a set of features for classification. This
metric is defined as
                                                     ∑𝑛−1
                                                      𝑖=1 𝐹𝑅𝑒𝑙(𝑠𝑖 , 𝑠𝑖+1 )
                                   𝐶𝑜ℎ𝑒𝑟𝑒𝑛𝑐𝑒(𝑇) =                          ,
                                                             𝑛−1
    where 𝑇 is the considered text;
           𝑠𝑖 is the 𝑖 𝑡ℎ sentence of the text.
    This metric is built based on 𝐹𝑅𝑒𝑙 metric, which is defined as follows:
                𝐹𝑅𝑒𝑙 (𝑆1 , 𝑆2 ) = 𝛼 × 𝐹𝑅𝑒𝑙𝑃𝑟𝑒𝑑(𝑆1 , 𝑆2 ) + (1 − 𝛼 ) × 𝐹𝑅𝑒𝑙𝐴𝑟𝑔𝑠(𝑆1 , 𝑆2 )
    where 𝛼 is a balancing coefficient, according to Basile et al. [14], the optimal value of 𝛼 is 0,5.
    The value of this metric depends on the values of its two components. The numeric representation
of relatedness of predicates is calculated as
                                                              |𝐶𝑝1 𝑝2 |
                                  𝐹𝑅𝑒𝑙𝑃𝑟𝑒𝑑(𝑆1 , 𝑆2 ) = 𝑙𝑜𝑔2               ,
                                                            |𝐶𝑝1 ||𝐶𝑝2 |
   where 𝑐𝑝1 and 𝑐𝑝2 are subsets of sentences from the corpus that have common predicate 𝑝1 from
the first sentence 𝑆1 and 𝑝2 with the second sentence 𝑆2 , respectively. 𝐶𝑝1 𝑝2 is a subset of the
adjacent sentences in the corpus, where 𝑝1 and 𝑝2 are the main verbs-predicates of the first and the
second sentences respectively [1].

                                                                                                        8
   The second component is relatedness between predicate arguments

                       1    1                                           1
   𝐹𝑅𝑒𝑙𝐴𝑟𝑔𝑠(𝑆1 , 𝑆2 ) = (         ∑          max wpsim(N𝑖 , N𝑗 ) +            ∑       max wpsim(N𝑖 , N𝑗 )),
                       2 |arg𝑠1 |           N𝑗 ∈arg𝑠2                |arg𝑠2 |        N𝑗 ∈arg𝑠1
                                 N𝑖∈arg𝑠1                                 N𝑖∈arg𝑠2

    where 𝑤𝑝𝑠𝑖𝑚 is the Wu-Palmer similarity, arg𝑠1 and arg𝑠2 are the sets of noun arguments of
predicates 𝑝1 and 𝑝2, respectively.
    WebText dataset together with outputs of GPT-2, which has been trained on the WebText data, are
used as samples for training and a separate subset is used as testing data. Both datasets have been
released by OpenAI and are publicly available [16]. Researchers from OpenAI also mention the
difference in amounts of different parts of speech between texts scraped from the web and those
generated by different models. After our experiments and hyperparameters tuning, the best results
with the defined features have been achieved by XGBoost model. The considered features include
maximum, minimum, and average values of the coherence metric [1] for all the pairs of consecutive
pairs of sentences in a text, number of nouns, adjectives, adpositions, and verbs. The choice of
features is explained by the following. The average value of the coherence metric is an indicator of
global connectivity, the overall consistency of all parts of the text.The low minimum value of the
metric may indicate the presence of a semantic gap in the text. This may indicate a change of a
subtopic in the text, an abrupt and inappropriate change of the topic. Low value also shows the
inconsistency of different parts of the text. The high maximum value shows the quality of the text, the
consistency of its components and ease of perception. The distribution of nouns, adjectives, verbs and
prepositions is considered in comparison for automatically generated and human-written texts (as for
the used datasets WebText and GPT-2 output) in Error! Reference source not found. Error!
Reference source not found.. A feature importance chart is given below in Figure 3. XGBoost model
[25] has been used. Feature importance for this algorithm depends on how often a feature has been
chosen as the one to make a split on. XGBoost is based on gradient boosting of decision trees.
TheFigure 4 shows an example of a tree that is built as one of the estimators in the process of the
algorithm execution (the reduced version of the tree is given for illustration). The model is flexible
and allows one to adjust it to a specific task because it is possible to set hyperparameters, such as
learning rate and number of trees, to set restrictions on the criterion of branching and the maximum
depth of trees, to tune other parameters of optimization The model described above has been
implemented using Python (version 3.6).
Table 1
Results
                                  Metric                         Result
                                 Accuracy                         0.7
                                 F1 score                         0.67
                                  Recall                         0.613
                                 Precision                        0.74

    WordNet has been used along with Stanford parser, which is implemented as a part of Stanza
library [26]. ROCStories corpus [27] has been used as a basic dataset for training the coherence
metric. The ROCStrories dataset consists of five-sentence stories. Each of the stories describes trivial
situations in people's lives that occur daily, the texts contain common words, simple sentences are
used. This dataset has been proposed for use in evaluating systems for natural language understanding
and for comparing systems in the Story Cloze Test [27]. The model achieves the following results.

7. Possible improvements and discussion
   The described model for bot identification can be extended, new characteristics of the text can be
added as features for classification. An advantage of the coherence metric is an opportunity to use it
for texts written in different languages. An ontology, a syntax parser, and a training corpus of plain
texts are needed. It is worth mentioning, that all of these tools are available for Ukrainian, namely
UkrWordNet [28], POS taggers and tools for syntax analysis, different corpora are available and are
being developed [29]. This way, it is possible to develop the described system for Ukrainian by

                                                                                                              9
retraining the existing model. The developed model of bot detection shows that this metric [1] is
easily adaptable and can be used for different tasks which require analyzing semantic properties of
text, its quality. The study of identification methods can be the basis for improving the methods of
natural language texts generation.




Figure 2: Distribution of nouns, adjectives, verbs and prepositions for human-written and
automatically generated texts




Figure 3: Feature importance based on gain characteristic

                                                                                                 10
.Figure 4: An example of a tree that is built as one of the estimators
   By understanding what makes a text similar to one written by a human, the best quality of
automatic natural text generation is achieved. The process of text generation by humans is complex, it
is not fully understood yet, the same is true about the process of perceiving texts. Different
researchers agree that coherence is an important aspect in the context of ease of perception. That is
why it is required to study coherence first and to include some mechanisms of maximizing coherence
into the generation architectures for the development of efficient natural language generation systems.

8. References
[1] O. O. Marchenko, O. S. Radyvonenko, T. S. Ignatova, et al., Improving Text Generation
    Through Introducing Coherence Metrics, Cybernetics and Systems Analysis 56 (2020) 13–21.
    doi:10.1007/s10559-020-00216-x.
[2] E. Reiter, R. Dale, Building applied natural language generation systems, Natural Language
    Engineering 3 (1997) 57 – 87. doi:10.1017/S1351324997001502.
[3] C. Kiddon, L. Zettlemoyer, Y. Choi, Globally Coherent Text Generation with Neural Checklist Models,
    in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
[4] W. S. Cho, P. Zhang, Y. Zhang, X. Li, M. Galley, C. Brockett, M. Wang, J. Gao, Towards
    Coherent and Cohesive Long-form Text Generation, in: Proceedings of the First Workshop on
    Narrative Understanding, 2019, pp. 1–11. doi: 10.18653/v1/W19-2401
[5] Coherent, Cambridge dictionary. URL: https://dictionary.cambridge.org/dictionary/english/coherent
[6] I. Sevbo, On the study of the structure of the coherent text, in: Linguistic research on the general
    and Slavic typology, Science, 1966.
[7] M. Lapata, R. Barzilay, Automatic Evaluation of Text Coherence: Models and Representations,
    in: Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence,
    Edinburgh, Scotland, UK, 2005.

                                                                                                     11
[8] D. Lin, LaTaT: Language and text analysis tools, in: Proceedings of the 1st International
     Conference on Human Language Technology Research, 2001, pp. 222–227.
[9] T. Briscoe, J. Carroll, Robust accurate statistical annotation of general text, in: Proceedings of
     the 3rd International Conference on Language Resources and Evaluation, 2002, pp. 1499–1504.
[10] N. McIntyre, Learning to Tell Tales: Automatic Story Generation from Corpora, Ph.D. thesis,
     University of Edinburgh, 2011.
[11] B. Grosz, A. Joshi, S. Weinstein, Centering: A framework for modeling the local coherence of
     discourse, Computational Linguistics 21 (1995) 203–225.
[12] R. Iida, T. Tokunaga, Metric for Evaluating Discourse Coherence based on Coreference
     Resolution, in: Proceedings of COLING, 2012, pp. 483–494.
[13] J. Li, D. Jurafsky, Neural net models of open-domain discourse coherence, in: Proceedings of
     Conference on Empirical Methods in Natural Language Processing, 2017, pp. 198–209.
[14] V. Basile, R. L. Condori, E. Cabrio, Measuring Frame Instance Relatedness, in: Proceedings of
     the 7th Joint Conference on Lexical and Computational Semantics, 2018, pp. 245–254.
     doi: 10.18653/v1/S18-2029
[15] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning
     Research 3 (2003) 993-1022.
[16] I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, J. Wang,
     Release Strategies and the Social Impacts of Language Models, 2019. URL:
     https://arxiv.org/abs/1908.09203.
[17] S. Gehrmann, H. Strobelt, A. Rush, GLTR: Statistical Detection and Visualization of Generated
     Text, in: Proceedings of the 57th Annual Meeting of the Association for Computational
     Linguistics: System Demonstrations, 2019, pp. 111–116.
[18] M. Bao, J. Li, J. Zhang, H. Peng, X. Liu, Learning Semantic Coherence for Machine Generated
     Spam Text Detection, in: Proceedings of the International Joint Conference on Neural Networks,
     2019. doi:10.1109/IJCNN.2019.8852340.
[19] A.Uchendu, T. Le, K. Shu, D. Lee, Authorship Attribution for Neural Text Generation, in:
     Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020.
[20] W. Zhong, D. Tang, Z. Xu, R. Wang, N. Duan, M. Zhou, J. Wang, J. Yin, Neural Deepfake Detection
     with Factual Structure of Text, in: Proceedings of the 2020 Conference on Empirical Methods in
     Natural Language Processing, 2020, pp. 2461–2470. doi: 10.18653/v1/2020.emnlp-main.193.
[21] Y. Tay, D. Bahri, C. Zheng, C. Brunk, D. Metzler, A. Tomkins, Reverse Engineering Configurations
     of Neural Text Generation Models, in: Proceedings of the 58th Annual Meeting of the Association
     for Computational Linguistics, 2020, pp. 275–279. doi: 10.18653/v1/2020.acl-main.25.
[22] A. Bakhtin, S. Gross, M. Ott, Y. Deng, M. Ranzato, A. Szlam, Real or fake? Learning to
     discriminate       machine       from      human        generated     text,      2019.      URL:
     https://arxiv.org/pdf/1906.03351.pdf.
[23] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, Y. Choi, Defending against
     neural fake news, Advances in Neural Information Processing Systems 32 (2019) 9051–9062.
[24] T. Fagni, F. Falchi, M. Gambini, A. Martella, M. Tesconi, TweepFake: about Detecting
     Deepfake Tweets, 2021. URL: https://arxiv.org/abs/2008.00036
[25] T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System, in: Proceedings of the 22nd
     ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp.
     785–794. doi: 10.1145/2939672.2939785.
[26] M. D. Christopher, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, D. McClosky, The Stanford
     CoreNLP Natural Language Processing Toolkit, in: Proceedings of the 52nd Annual Meeting of
     the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60. doi:
     10.3115/v1/P14-5010.
[27] R.Sharma, J. F. Allen, O. Bakhshandeh, N. Mostafazadeh, Tackling the Story Ending Biases in
     The Story Cloze Test, in: Proceedings of the 56th Annual Meeting of the Association for
     Computational Linguistics, 2018, pp. 752–757.
[28] A. Anisimov, O. Marchenko, A. Nikonenko, E. Porkhun, V. Taranukha, Ukrainian WordNet:
     Creation and Filling, in: H. L. Larsen, M. J. Martin-Bautista, M. A. Vila, T. Andreasen, H.
     Christiansen (eds) Flexible Query Answering Systems. FQAS 2013. Lecture Notes in Computer
     Science, vol 8132, Springer, Berlin, Heidelberg. doi: 10.1007/978-3-642-40769-7_56.
[29] Lang Uk Project, URL: https://lang.org.ua.



                                                                                                   12